Data Miners Blog: gordon

Showing posts with label gordon. Show all posts

Tuesday, August 26, 2014

An Achievement on Stack Overflow

Last Friday (August 22nd), I achieved a milestone on Stack Overflow, a computing question-and-answer site, by hitting 200,000 points. So far this year, I have also earned more points than anyone, a testament to my obsession with the site (you can see yearly user rankings here). As for overall points, I will never match the leader Jon Skeet because who had a head start of many years.

My answers are almost almost exclusively for answers related to SQL and various databases. The site is highly geared toward "tools" questions, so there are few general analysis questions.

So, this blog is sharing some of my thoughts and my history on the site.

Clearly, I have helped a lot of people on Stack Overflow, around the world. The most rewarding part are the thank-yous from real people working on real problems. On several instances, I have helped speed up code by more than 99% -- turning hours of drudgery into seconds or minutes of response time.

But answering questions has helped me too:

My technical knowledge of databases has greatly improved, particularly the peculiarities (strengths and weaknesses) of each database engine.
I have learned patience for people who are confused by concepts in SQL.
I have (hopefully) learned how to explain concepts to people with different levels of competence.
I have learned how to figure out (sometimes) what someone is really asking.
I have a strong appreciation for what SQL can do and what SQL cannot do.
It has definitely increased the number of hits when I egosurf.

A few months after starting, I stopped down voting questions and answers. "Down voting" is usually seen as "not-nice", making the other person defensive, confused, and perhaps angry. A lesson for real life: being mean (by down voting) is not nearly so useful as offering constructive criticism (by commenting).

This all started in January, 2012 (a bit over two and a half years ago). The reason was simple: I was writing a system called The Netting Machine for the Lehman Brothers Estate and it was stretching my knowledge of SQL Server. One particular problem involved dynamic queries. Google kept directing me to the same question on Stack Overflow. This best answer was close to what I needed, but not quite. It was only half-way there. The third time I landed on the page, I added my own answer. This was actually for quite selfish reasons: the next time Google took me there, I wanted to see the full answer.

Lo and behold, my answer was accepted and up voted. When the OP ("original poster" -- Stack Overflow lingo for the person asking the question) accepted my answer, s/he had to unaccept another. That answer was by Aaron Bertrand, a SQL Server guru whose name I recognized from his instructive blog posts. Aaron commented about the "unaccept". In the back of my mind, If Aaron thinks this is important, then there must be something to it. Ultimately, I can blame Aaron (whom I have not yet met in person) for getting me hooked.

For a few months, I sporadically answered questions. Then, in the first week of May, my Mom's younger brother passed away. That meant lots of time hanging around family, planning the funeral, and the like. Answering questions on Stack Overflow turned out to be a good way to get away from things. So, I became more intent.

Stack Overflow draws you in not only with points but with badges and privileges. Each time I logged in, the system "thanked" me for my participation with more points, more badges, and more privileges. This continued. One day (probably in June), I hit the daily upvote maximum of 200 upvotes (you also get points when an answer is accepted or someone offers a bounty). One week, I hit 1000 points. One month, 5,000 points. As an individual who is mesmerized by numbers, I noticed these things.

Last summer, I hit 100,000 points in September and slowed down. I figured that six figures was enough, and I had other, more interesting things to do -- a trip to Myanmar, my sister's wedding, our apartment in Miami, classes to teach (San Francisco, Amsterdam, Chicago, San Antonio, Orlando) and so on.

I didn't start 2014 with the intention of spending too much of my time on the site. But three things happened in January. The first was a fever with a rash on my face. It kept me home with not-enough to do. So, I answered questions on Stack Overflow. Then, I had an attack of gout. That kept me home with not-enough to do. And finally, the weather in January in New York was, well, wintery -- lots of cold and lots of snowy. More reasons to stay home and answer questions.

By the end of January, I was the top scorer for the month. "Hey, if I can do it in January, let's see what happens in February." I had thought of relenting in April: we flew to Greece and spent two nights on Mount Athos. Mount Athos is a peninsula in northern Greece, devoted to twenty-one Orthodox monasteries -- and nothing else. It is inhabited by a few thousand monks living medieval lifestyles. The only way to visit is as a "pilgrim", staying at a monastery. An incredible experience. No internet. But, I was able to make up the point deficit on Stack Overflow.

This year, each month that passes is another month where I seem to be the top point-gatherer on Stack Overflow. At this point, I might as well make it to the end of the year. I don't know if I will, but I do hope to help a few other people and to learn more about databases and how people are using them.

Sunday, May 18, 2014

Armed Bandits: A Statistical Approach

This is a continuation of my previous post on multi-armed bandits. And, I'm guessing there will be at least one more after this.

The Multi-Armed Bandit problem is a seemingly simple problem. A gambler is faced with a row of slot machines, each of which returns a different winning. S/he need to devise a strategy to find the winningest slot machine as quickly as possible and then just play that one.

Most of the strategies for doing this are based on a greedy-algorithm approach. They are some variation on: randomly (or round robinly) choose slot machines until some threshold has been reached. Then continue playing the winningest one. These actually work pretty well. But I am interested in applying basic statistics to this.

Before doing that, let me explain why I am interested. Imagine that I have a web site and I have an ad space to fill. Here are different things I might put there:

A run of network ad that will make some amount of money per impression.
A click-through ad that will make some amount of money if someone clicks on it.
A partner ad that will make some amount of money if someone signs up for something.

The Multi-Armed Bandit provides an automated means of testing all three of these at once, along with variations that may, or may not, prove better than business-as-usual. I think of it as automated champion-challenger models.

Here is a "statistical" approach to this problem. Let me assume that there are N campaigns being run. Each campaign has a payout distribution. I can calculate the average payout for each campaign. In the end, I want to choose the campaign that has the largest average payout. Note that I'm make assumptions here that the the campaigns perform consistently across time and across the visitor population. Those are other issues I discussed earlier. Let's focus on the basic problem here.

By the Central Limit Theorem, we know that we can estimate the average based on a sample of data. This estimate of the average has an average and a standard deviation, which (once there are enough samples) gets smaller and smaller, meaning that the average is better and better.

The idea is then simple. At the beginning, give each campaign the same estimate with a wide confidence interval. The intervals all overlap completely, so the choice of best campaign is random. Initially, we might want to round-robin the data to get some initial values. Relatively quickly, though, we should get estimates for each of the campaigns; these will be inaccurate but they will have wide confidence intervals.

At each iteration, we need to update the average and standard deviation. Fortunately, there are easy incremental algorithms for both, so all the historical data does not need to be saved. This article discusses various algorithms for calculating variance, and hence standard deviation.

The question is: if we have multiple averages and standard errors, how do we choose the appropriate campaign at each step. We can run a fast simulation to get the best campaign. For each campaign, generate a random number based on the estimated average and standard error. Choose the campaign that has the largest number.

What happens over time is that the campaign with the best payout should become more and more confident, as well as having the highest average. Its confidence interval will shift way from the others, further increasing the odds of that campaign being chosen. This is a positive feedback mechanism. Note that I am using the term "confidence interval" as an aid to visualizing what is happening; this method is not actually using any p-values generated from the confidence interval.

One nice feature about this method is that it can adapt to the chosen solution getting worse. If so, the average will decrease (but not the standard error) and other campaigns might be chosen. Getting this to work involves a bit more effort, because you probably want to keep the sample size fixed -- otherwise the learning rate would be too small.

A note about distributions. This solution is depending onto the distribution of the sample average, not the distribution of the original payout. The sample average should (in the limit) have a normal distribution, characterized by the average and standard error. This is not a statement about the original data distribution, only about the average. And, in the end, we want to choose the campaign that has the best average. This is handy, because the three examples that I gave earlier are very different. One has a constant (but low) payout and the other two are biased toward zero payouts.

I do believe that this method will produce reasonable results in practice. However, it does bring up subtle issues about how the underlying distributions of the payouts affect the averages. On the surface, it seems pretty sound, and it should work pretty well in practice.

Sunday, March 30, 2014

Doing the Right Thing: Are your measures correct?

"A lot of good analysis is wasted doing the wrong thing."

Anyone who has worked with data on business problems is probably aware of this adage. And this past week, I was reminded once again of this fact while analyzing a marketing program. This example is so striking, because difference between doing the "right" thing and the "almost-right" thing ended up being more than a factor of 10 -- a really big variance on a financial calculation.

Some background. One of my clients does a lot of prospecting on the web. They have various campaigns to increase leads to their web site. These campaigns cost money. Is it worth it to invest in a particular program?

This seems easy enough to answer, assuming the incoming leads are coded with their source (and they seem to be). Just look at the leads coming in. Compare them to the customers who sign up. And the rest, as they say, is just arithmetic.

Let's say that a customer who signups up on the web has an estimated value of $300. And, we can all agree on this number because it is the Finance Number. No need to argue with that.

The first estimate for the number of leads brought in was around 160, produced by the Business Intelligence Group. With an estimated value of $300, the pilot program was generating long term revenue of $48,000 -- much more than the cost of the program. No brainer here. The program worked! Expand the program! Promote the manager!

The second estimate for the number of leads brought in was 12. With an estimated value of $300, the pilot was generating $3,600 in long term revenue -- way less than the cost of the program. Well, we might as well burn the cash and roast marshmellows over the flame. No promotion here. Know any good recruiters?

Both these estimates used the same data sources. The difference was in the understanding of how the "visitor experience" is represented in the data.

For instance, a visitor has come to the site 300 times in the past. The 301st visit was through the new marketing program. Then two weeks later on the 320th visit, magic happens and the visitor becomes a customer. Is the lead responsible for the acquisition? This problem is called channel attribution. If the customer had signed up when s/he clicked as a lead then yes, you could attribute all or most value to that marketing program. But two weeks and 20 visits later? Not likely. The lead was already interested.

A more serious problem occurs through the complexities of web visits. If a visitor is not logged in, there is no perfect way to track him or her (or "it" if it were a dog). Of course, this company uses cookies and browser caches and tries really, really hard to keep track of visitors over time. But the visitor cannot be identified as a customer until s/he has logged in. So, I may be a real customer, but happen to be trying out a new browser on my machine. Or, I visit from an airport lounge and don't log in. Or some other anonymous visit. This seems like a bona fide lead when arriving through the marketing program.

And then . . . the visitor keeps using the new browser (or whatever). And then later, s/he decides to login. At that point, the visitor is identified as a customer. And, more importantly, the VisitorId associated with the visitor is now a customer. But that doesn't mean that the lead created the customer. The logging in merely identified an existing customer.

Guess what? This happened more times than you might imagine. In many, many cases, the 160 "customers" generated by the leads had been customers for months and years prior to this marketing campaign. It doesn't make sense to attribute their value to the campaign.

The moral of this story: it is important to understand the data and more importantly, to understand what the data is telling you about the real world. Sometimes in our eagerness to get answers we might miss very important details.

As a final note, we found the problem through a very simple request. Instead of just believing the number 160 in the report generated by the Business Intelligence Group, we insisted on the list of leads and account numbers created by the program. With the list in-hand, the problems were fairly obvious.

Tuesday, March 25, 2014

Three SQL Constructs You Can Forget About

SQL is a very powerful language, which could, of course, be made even more powerful and useful. This post discusses three features of the language -- ANSI standard features -- that seem not only unnecessary but downright detrimental. That is, they seem to cause much more confusion than they provide in functionality. And, in all these cases, it would be easy to work around their absence.

Although it would be nice to remove these from the language itself, that is unlikely to happen. However, they can be de-prioritized for people learning SQL. These constructs are easy to work around and are less functional than their alternatives. When learning SQL, these should be learned later in the process.

(1) INSERT . . . VALUES()

The first construct is the use of VALUES with INSERT, as in:

insert into t(col1) values(1);

In almost every database, this is easily replaced with:

insert into t(col1)
select 1;

In some databases, you might have to add a from dual or from sys.dummy to make this work.
And, in every respect except one, the INSERT . . . SELECT method is better. For instance, you can add a WHERE clause to be sure that the value doesn't already exist:

insert into t(col1) select 1 where not exists (select 1 from table t2 where t2.col1 = t)

Or, you can readily add other values, from this or another table:

insert into t(col1, col2) select 1, (select count(*) from t2)

Trying to fit this into a VALUES statement just causes syntax errors and confusion.

And, you can use UNION ALL to add multiple rows at the same time.

The VALUES statement has exactly one advantage and that is the fact that it is standard. The same code will work in multiple databases. That seems very minor. It would be better if the standard had a way of using SELECT to return a row without a table.

(2) SELECT DISTINCT

The next unnecessary construct is SELECT DISTINCT. First, this is easily replaced with GROUP BY. So:

select distinct a, b, cfrom t;

is the same as:

select a, b, cfrom tgroup by a, b, c;

What makes the GROUP BY better? Primarily the fact that you can have a HAVING clause.

So, SELECT DISTINCT is sometimes understood to be: "Get me all the rows that are distinct". Rather than, "Get me the distinct values from all the rows." Actually, that first interpretation makes a lot of sense, even if it is wrong. Not only is there no danger of confusion with the GROUP BY, but including HAVING COUNT(*) = 1 actually solves the first problem. No way to do that with SELECT DISTINCT.

The second problem is perhaps more dangerous. Have you ever seen someone write this?

select distinct(a) b, cfrom t;

Here, the DISTINCT seems to be used like a function. The intention is "Get me distinct values of a along with arbitrary values of b and c". Of course, this is exactly the same with or without the parentheses. DISTINCT is not a function. This usage is so prevalent that Postgres introduced the DISTINCT ON syntax to support it.

What advantages does SELECT DISTINCT have? The syntax is shorter and you don't have to repeat the column names in a GROUP BY clause. In a world of cut-and-paste, copying the column to GROUP BY is negligible effort. And, it does allow SELECT DISTINCT *. However that is a construct that I wouldn't miss at all.

(3) COUNT(column)

Finally, there is the COUNT aggregation function with a column as an argument. Just to be clear, I have no problem with COUNT(DISTINCT column) or COUNT(*) or COUNT(1).

No doubt, the designed of SQL were obsessed with NULL values (and despite the obsession, they still didn't get it right). Wouldn't everyone in the world (who uses SQL) want to count the number of non-NULL values in a column? What else could COUNT(column) mean?

Well, in many contexts, people probably think it means COUNT(DISTINCT column). Consider the following query:

select c.country, count(c.CustomerId), count(o.OrderId)from Customers c join Orders o on c.CustomerId = o.CustomerId;

Many people might write this code, just like this, with the intention of getting the number of customers and the number of orders in each country. How sad when they learn that these are the same! There are no repeat purchasers anywhere. (COUNT(DISTINCT c.CustomerId) fixes this problem.)

Such confusion would be a non-issue.

And, if you wanted to count non-NULL values? Why not do it explicitly, so you can remember what the query is supposed to be doing:

select sum(case when a is not null then 1 else 0 end)

Yes, this takes a bit more typing but the query is much clearer on what it is doing. It would be much shorter if all databases supported the "boolean" is an "integer" shortcut:

select sum(a is not null)

(4) ,

What is a list of three things without a fourth to cap it off? Just don't use a comma in the FROM clause. Explicit join syntax is more expressive and clearer in every case. The , can be replaced by CROSS JOIN.

Thursday, March 20, 2014

Big Data and SQL

I happen to think that SQL is a very viable option for analyzing big data. I was thinking about this when I a book review recently:

For instance, Siegel reports, people who buy small felt pads that adhere to the bottom of chair legs (to protect the floor) are more likely than others to be good credit risks.

For some people, results like this conjure up magic. PhDs in white coats bustling around, surrounded by acres of machines humming away pondering this imponderable problem (or is that the air conditioning making the noise). In fact, something like this is readily calculated from a normal decision support database containing historical data.

So, how hard is it to write the SQL?

The place to start is to rephrase the question. Let's ask it as:

For all products purchased by customers in 2013, what is the non-payment rate for the first three months of 2014?

Note that this is carefully phrased as a "before" and "after" problem. Although that does not guarantee causality, it does help.

Next, assume that we have the following tables:

Customers
Orders
OrderProducts
Invoices (monthly, with a flag to indicate non-payment)

The following query gets all the products from 2013:

select op.ProductId, count(*) as NumProducts,
count(distinct o.CustomerId) as NumCustomers
from Orders o join
OrderProducts op
on o.OrderId = op.OrderId
where o.OrderDate >= '2013-01-01' and
o.OrderDate < '2014-01-01'
group by op.ProductId;

The following gets all customers who didn't pay in the first three months of 2014. This might look something like:


select i.CustomerId

from Invoices i
where i.InvoiceDate >= '2014-01-01' and
i.InvoiceDate < '2014-04-01' and
i.NotPaid = 1;

These can then easily be combined to get a list of products, by the proportion of customers who did not pay:


select ProductId, count(*) as NumCustomers,

count(pc.CustomerId) as numNotPaid,
count(*)*1.0 / count(pc.CustomerId) as NonPayRate
from (select op.ProductId, op.CustomerId
from Orders o join
OrderProducts op

           on o.OrderId = op.OrderId
      where o.OrderDate >= '2013-01-01' and
            o.OrderDate < '2014-01-01'
      group by op.ProductId, op.CustomerId

) pc left outer join
(select i.CustomerId
from Invoices i
where i.InvoiceDate >= '2014-01-01' and
i.InvoiceDate < '2014-04-01' and
i.NotPaid = 1
) np
on pc.CustomerId = np.CustomerId
group by pc.ProductId
order by NonPayRate desc;

This isn't a particularly complex SQL. Instead, we can think about what is really important. The first is being willing to ask the question. I think a major constraint in business is that managers and executives are hesitant to ask questions. They don't have a sense of what is "easy" to answer and what is "hard". They also fear getting different answers from different people.

The second is the interpretation. The statement that people who want to protect their furniture are better credit risks has a nice warm and fuzzy quality: people who care about their belongings also care about their credit. Perhaps other factors are at work. People buy new furniture and want to protect it because they have access to cash or credit -- they may simply be richer than other people at least for a period of time. Or, felt pads may only be sold in areas where people tend to own their homes, so there is a store-bias in the merchandizing. Or, customers who buy these small items may be paying in cash and never make larger purchases that might measure credit risk.

To understand what is really happening would require further analysis. To get started just takes asking some insightful questions.

Tuesday, March 11, 2014

Heuristics in Analytics

Last week, a book -- a real, hard-cover paper-paged book -- arrived in the mail with the title: Heuristics in Analytics: A Practical Perspective of What Influences Our Analytic World. The book wasn't a total surprise, because I had read some of the drafts a few months ago. One of the authors, Fiona McNeill is an old friend and the other Carlos is a newer friend.

What impressed me about the book is its focus on the heuristic (understanding) side of analytics rather than the algorithmic or mathematical side of the subject. Many books that attempt to avoid technical detail end up resembling political sound-bites: any substance is as lost as the figures in a Jackson Pollock painting. You can peel away the layers, and still nothing shows up except eventually for a blank canvas.

A key part of their approach is putting analytics in the right context. Their case studies do a good job of explaining how the modeling process fits into the business process. So, a case study on collections discusses different models that might be used, answering questions such as:

How long until someone will pay a delinquent bill?
How much money can likely be recovered?

This particular example goes through multiple steps around the business process, including financial calculations on how much the modeling is actually worth. It also goes through multiple types of models, such as a segmentation model (based on Kohonen networks) and the differences -- from the business perspective -- of the different segments. Baked into the discussion is how to use such models and how to interpret the results.

In such a fashion, the book covers most of the common data mining techniques, along with special chapters devoted to graph analysis. This is particularly timely, because graphs are a very good way to express relationships in the real world.

I do wish that the data used for some of the examples in the book were available. They would make some very interesting examples.

Saturday, March 1, 2014

Lines and Circles and Logistic Regression

Euclidean geometry, formalized in Euclid's Elements about 2,300 years ago, is in many ways a study of lines and circles. One might think that after more than two millennia, we have moved beyond such basic shapes particularly in a realm such as data mining. I don't think that is so true.

One of the overlooked aspects of logistic regression is how it is fundamentally looking for a line (or a plane or a hyperplane in multiple dimensions). When most people learn about logistic regression, they start with an understanding of the sinuous curve associated with it (you can check out the Wikipedia page, for instance). Something like this in one dimension:

Or like this in two dimensions:

These types of pictures suggest that logistic regression is sinuous and curvaceous. They are actually misleading. Although the curve is sinuous and curvaceous, what is important is the boundary between the high values and the low values. This separation boundary is typically a line or hyperplane; it is where the value of the logistic regression is 50%. Or, assuming that the form of the regression is:

logit(x) = f(x) = a*x + y

Then it is where the f(x) is set to 0. What does this look like? A logistic regression divides the space into two parts, one part to the "left" (or "above") the line/hyperplane and one part to the "right" (or "below"). A given line just splits the plane into two parts:

In this case, the light grey would be "0" (that is, less than 50%) and the blue "1" (that is, greater than 50%). The boundary is where the logistic function takes on the value 50%.

Note that this is true even when you build a logistic regression sparse data. For instance, if your original data has about 5% 1s and 95% 0s, the average value of the resulting model on the input data will be about 5%. However, somewhere in the input space, the logistic regression will take on the value of 50%, even if there is no data there. Even if the interpretation of a point in that area of the data space is non-sensical (the customer spends a million dollars a year and hasn't made a purchase in 270 years, or whatever). The line does exist, separating the 0s from the 1s, even when all the data is on one side of that line.

What difference does this make? Logistic regression models are often very powerful. More advanced techniques, such as decision trees, neural networks, and support-vector machines, offer incremental improvement, and often not very much. And often, that improvement can be baked back into a logistic regression model by a adding one or more derived variables.

What is happening is that the input variables (dimensions) for the logistic regression are chosen very carefully. In a real situation (as opposed to the models one might build in a class), much thought and care has gone into the choice of variables and how they are combined to form derived variables. As a result, the data has been stretched and folded in such a way that different classification values tend to be on different "side"s of the input space.

This manipulation of the inputs helps not only logistic regression but almost any technique. Although the names are fancier and the computing power way more advanced, very powerful techniques rely on geometries studied 2,300 years ago in the ancient world.

Wednesday, February 26, 2014

Taking a Random Sample on Amazon Redshift

Recently, I was approached by Vicky whom I'm working with at a client, to help with a particular problem. She wanted to calculate page view summaries for a random sample of visitors from a table containing about a billion page views. This is a common problem, especially as data gets larger and larger. Note that the sample itself is based on visitors, so a simple random sample is not sufficient. We needed a sample of visitors and then all the pages for each visitor.

Along the way, I learned some interesting things about Redshift, taking random samples, and working with parallel and columnar databases.

For those not familiar with it, Redshift is an Amazon cloud data store that uses ParAccel, a parallel columnar database a based on Postgres (an older version of Postgres). Postgres is a standard-enough relational databases, used by several database vendors as the basis of their products.

Columnar databases have interesting performance characteristics, because the database stores each column separately from other columns. Although generally bad performance-wise for ACID-compliant transactions (if you don't know what ACID is, then you don't need to know), columnar databases are good for analysis.

However, your intuition about how things work may not apply. A seemingly simple query such as this:

select *

from PageViews

limit 10;

takes a relatively long time (several minutes) because all the columns have to be read independently. On the other hand, a query such as:

select min(BrowserId), max(BrowserId)
from PageViews;

Goes quite fast (a few seconds), because only one column has to be read into memory. The more columns the queries reads, the slower it is -- other things being equal.

Back to the random sample. A typical way of getting this type of random sample is to first find the reduced set of visitors and then join them back to the full page views. This sounds cumbersome, but the strategy actually works well on many databases. Applied to the query we were working with, the resulting query looks something like:

select pv.BrowserId,
from (select distinct BrowserId
from PageViews
order by random()
limit 100000
) list join
PageViews pv
on list.BrowserId = pv.BrowserId
group by BrowserId;

This is a reasonable and standard approach to reduce the processing overhead. The subquery list produces all the BrowserIds and then sorts them randomly (courtesy of the random() function). The limit clause then takes a sample of one hundred thousand (out of many tens of millions). The join would normally use an indexed key, so it should go pretty fast. On Redshift, the subquery to get list performs relatively well. But the entire query did not finish (our queries time out after 15-30 minutes). We experimented with a several variations, to no avail.

What finally worked? Well, a much simpler query and this surprised us. The following returned in just a few minutes:

select BrowserId,
from PageViews pv
group by BrowserId
order by random()
limit 100000;

In other words, doing the full aggregation on all the data and then doing the sorting is actually faster than trying to speed up the aggregation by working on a subset of the data.

I've been working with parallel databases for over twenty years. I understand why this works better than trying to first reduce the size of the data. Nevertheless, I am surprised. My intuition about what works well in databases can be inverted when using parallel and columnar databases.

One of Vicky's requirements was for a repeatable random sample. That means that we can get exactly the same sample when running the same query again. The random() function does not provide the repeatability. In theory, by setting the seed, it should. In practice, this did not seem to work. I suspect that aspects of load balancing in the parallel environment cause problems.

Fortunately, Postgres supports the md5() function. This is a hash function that converts a perfectly readable string into a long string containing hexadecimal digits. These digits have the property that two similar strings have produce very different results, so this is a good way to randomize strings. It is not perfect, because two BrowserIds could have the same hash value, so they would always be included or excluded together. But, we don't need perfection; we are not trying to land a little Curiousity lander in a small landing zone on a planet tens of millions of miles away.

The final form of the query was essentially:

select BrowserId,
from PageViews pv
group by BrowserId
order by md5('seed' || BrowserId)
limit 100000;

The constant "seed" allows us to get different, repeatable sample when necessary. And Vicky can extract her sample in just a few minutes, whenever she wants to.

Sunday, October 21, 2012

Catch our Webcast on November 15

Gordon and I rarely find ourselves in the same city these days, but on November 15 we will be in Cary, North Carolina with our friends at JMP for a webcast with Anne Milley. The format will be kind of like the first presidential debate with Anne as the moderator, and kind of like the second one with questions from you, the audience. Sign up here.

Sunday, March 11, 2012

Measuring Site Engagement: Pages or Sessions

One of our clients is a large media website that faced a simple question: What is the best way to find the most engaged users on the web site? The goal was to focus a marketing effort on these users.

A media web site is challenging, because there is no simple definition of engagement or customer worth. The idea is that engagement can either lead to more advertising views or to longer subscriptions, depending on the business model for the site. On the other hand, for a retailing site, the question is simpler, because there is a simple method to see who the best customers are. Namely, the amount of money they spend.

Engagement is a nice marketing concept, but how can it be defined in the real world? One way is to simply look at the number of page views during some period of time. Another is to look at the number of sessions (or alternatively days of activity if sessions are not available) during a specified period of time. Yet another is to measure breadth of usage of the site over a period of time: Does the user only go to one page? Is the user only coming in on referrals from Google?

The first analysis used one month of data to define engagement. The top users for one month were determined based on pages and sessions. Of course, there is a lot of overlap between the two groups -- about 60% of the top deciles overlapped.

Which group seems better for defining engagement, the top users by page views or by sessions? To answer this, let's borrow an idea from survival and measure how many users are still around nine months later. (Nine months is arbitrary in this case). In this case, the return rate for the top decile for sessions was 74.4% but for the top decile for pages was lower at 73.8%. Not a big difference, but one that suggests that sessions are better.

Actually, the results are even more striking for visitors who are not in both top deciles. For the non-overlapping group, the session return rate is69.6% versus 67.9% for the page deciles.

For defining engagement, we then extended these results to three months instead of one to find the top one million most engaged users. The three measures are:

Visitors that have the most page views over three months.
Visitors that have the most sessions over three months.
Visitors in the top tercile of sessions (third) in each month, then take the highest terciles.

Three months was chosen as a rather arbitrary length of time, because the data was available. Holding it constant also lets us understand the difference between sessions and page views.

These three methods all produced about the same number of visitors -- the goal was to find the top one million most engaged users.

By these measures, the top one million visitors chosen by the three methods had the following "return" rates, nine months later:

Page views in three months: 65.4%
Sessions in three months: 65.9%
Sessions over three months: 66.9%

The nine-month survival suggests that the sessions over three months is the better approach for measuring engagement.

Tuesday, January 17, 2012

Writing to a text file from SQL Server

It has been a while since I've contributed to the blog . . . not because I've had nothing to say. In this time, I've been spending a lot of time working with SQL Server, producing useful stored procedures (and insights). In this post, I discuss one of them, a stored procedure in SQL Server to write text to a file.

This stored procedure is a utility. I learned a lot along the way while trying to write it. This post is intended to explain these learnings.

The approach that I'm taking is to use xp_cmdshell to write one line at a time using the DOS echo command. A different approach uses OLE automation and the File System Object. I couldn't get this to work, possibly because it requires configurations that I don't know about; possibly because I don't have the right permissions.

My stored procedure is called usp__AppendToFile and the code is at the end of this post. If you care about naming conventions, here is the reasoning behind the name. The "usp" prefix is for user stored procedure. Starting a stored procedure with usp or sp seems redundant to me, but appears to be a common and perhaps even a best practice. The double underscore is my convention, saying that this is a utility. It is then followed by a reasonable name.

usp__AppendToFile does the following: It takes a string (varchar(max)) and an optional end-of-line character. It then writes the string, one line at a time, using the echo command in DOS. By passing in the end of line character, the stored procedure can work with text that uses the DOS standard end of line (carriage return followed by line feed, the default) as well as other standards.

Although seemingly simple and using familiar tools, I learned several things from this effort.

My first lesson is that in order to write to a file, you need to be able to access it. When running you a command in SQL Server, it is not really "you" that needs permissions. The SQL Server service needs to be able to access the file. And this depends on the user running the service. To see this user, go to the Control Panel, choose the Administrative Tools, and select Services. Scroll down to find the SQL Server service (called something like SQL Server Agent), and look in the column Log On As.

As an example, the user running the service on one machine used a local machine account rather than a Windows verified domain account. For this reason, SQL Server could not access files on the network. Changing the service to run on a Windows-authenticated enabled SQL Server to create a file. (The alternative of changing the permissions for the user was not possible, since I do not have network sys admin privileges.)

The second lesson is that in order to write to a file using xp_cmdshell, you need to have xp_cmdshell enabled as shown here. There are good reasons why some DBAs strongly oppose enabling this option, since it does open up a security hole. Well, actually, the security hole is the fault of Microsoft, since the command is either enabled or disabled at the server level. What we really want is to give some users access to it, which denying others.

Third, the DOS way to write text to a file is using the echo command. Nothing is as simple as it seems. Echo does generally write text. However, it cannot write an empty line. Go ahead. Open a CMD shell, type in echo and see what happens. Then type in echo with a bunch of spaces and see what happens. What you get is the informative message: ECHO is on. Thanks a bunch, but that's not echoing what was on the command line.

I want my procedure to write blank lines when it finds them in the string. To fix this problem, use the echo. command. For whatever reason, having the period allows an empty line to be written. Apparently, other characters work as well, but period seems to be the accepted one.

The problems with DOS seem solved, but they are not. DOS has another issue: some special characters are interpreted by DOS, even before echo gets to them. For instance, > is interpreted to put the results to a file; | is interpreted as a pipe between commands, and & is interpreted as a background command. Fortunately, these can be escaped using the DOS escape character, which I'm sure everyone knows is a caret (^).

But, this issue does not end there, because special characters might be in a string, in which case they do not need to be escaped. Parsing a string in a stored procedure to find quotes is beyond the range of this stored procedure. Instead, if there are no double quotes in the string, then it escapes special characters. Otherwise, it does not.

Combining these lessons, here is what I consider to be a useful utility to write a string to a text file, even when the string consists of multiple lines.

CREATE procedure usp__AppendToFile (
   @str varchar(max),
   @FileName varchar(255),
   @EOL varchar(10) = NULL
) as
begin
   if @EOL is NULL
   begin
       set @EOL = char(13) + char(10);
   end;

   -- the period allows for empty lines
   declare @prefix varchar(255) = 'echo.';
   declare @suffix varchar(255) = '>>'+@FileName;

   -- Escape special characters so things work
   -- But escapes work funny when in double quotes (and maybe single quotes too)
   set @str = (case when charindex('"', @str) = 0
                    then replace(replace(replace(@str, '|', '^|'), '>', '^>'), '&', '^&')
                    else @str
               end);

   while (@str <> '')
   begin
       declare @pos int = charindex(@EOL, @str);
       declare @line varchar(8000) = (case when @pos > 0 then left(@str, @pos) else @str end);
       set @str = (case when @pos > 0 then substring(@str, @pos+2, 1000000) else '' end);

       set @line = @prefix+@line+@suffix;

       --write @line to file;
       exec xp_cmdshell @line;

   end;
end;  -- usp__AppendToFile

Thursday, February 25, 2010

Agglomerative Variable Clustering

Lately, I've been thinking about the topic of reducing the number of variables, and how this is a lot like clustering variables (rather than clustering rows). This post is about a method that seems intuitive to me, although I haven't found any references to it. Perhaps a reader will point me to references and a formal name. This method using Pearson correlation and principal components to agglomeratively cluster the variables.

Agglomerative clustering is the process of assigning records to clusters, starting with the records that are closest to each other. This process is repeated, until all records are placed into a single cluster. The advantage of agglomerative clustering is that it creates a structure for the records, and the user can see different numbers of clusters. Divisive clustering, such as implemented by SAS's varclus proc, produces something similar, but from the top-down.

Agglomerative variable clustering works the same way. Two variables are put into the same cluster, based on their proximity. The cluster then needs to be defined in some manner, by combining information in the cluster.

The natural measure for proximity is the square of the (Pearson) correlation between the variables. This is a value between 0 and 1 where 0 is totally uncorrelated and 1 means the values are colinear. For those who are more graphically inclined, this statistic has an easy interpretation when there are two variables. It is the R-square value of the first principal component of the scatter plot.

Combining two variables into a cluster requires creating a single variable to represent the cluster. The natural variable for this is the first principal component.

My proposed clustering method repeatedly does the following:

Finds the two variables with the highest correlation.
Calculates the principal component for these variables and adds it into the data.
Maintains the information that the two variables have been combined.

The attached SAS code (available at sas-var-hierarchical-clustering-v01.sas) does exactly this, although not in the most efficient and robust way. The bulk of the code is a macro, called buildcolumns, that appends the new cluster variables to the data set and maintains another table called columns which has the information about the rows. After I run this code, I can select different numbers of variables using the expression:

proc sql;
....select colname
....from columns
....where counter <= [some number] <>

These variables can then be used for predictive models or visualization purposes.

The inner loop of the code works by doing the following:

Calling proc corr to calculate the correlation of all variables not already in a cluster.
Transposing the correlations into a table with three columns, two for the variables and one for the correlation using proc transpose.
Finding the pair of variables with the largest correlation.
Calculating the first principal component for these variables.
Appending this principal component to the data set.
Updating the columns data set with information about the new cluster.

The data set referred to in the code comes from the companion site for . The code will fail (by running an infinite loop) if any variables are missing or if two variables are exactly correlated.

Wednesday, February 10, 2010

Creating DDL For An Entire Database In SQL Server 2008

Recently, I started a new project which has a database component. I looked around for some visual data modeling tools, and I settled on just using the diagrams capability of SQL Server. Since the client is using SQL Server, it was simple to download SQL Server Express and get started using their diagramming tool.

After creating a bunch of tables, I learned that SQL Server Database Diagrams do not produce the Data Definition Language (DDL) to create the database. Instead, the tables are created in sync with the diagram. Furthermore, SQL Server does not have a command that creates the DDL for an entire database. Right clicking on two dozen tables is cumbersome. But even worse, it would not provide complete DDL, since the table DDL does not include index definitions.

I have seen some debate on the web about the merits of graphical tools versus text DDL. Each has their advantages, and, personally, I believe that a decent database tool should allow users to switch between the two. The graphical environment lets me see the tables and their relationships. The text allows me to make global changes, such as:

Changing all the SMALLDATETIME data types to DATE when I go to a commercial version of SQL Server. The Expression version does not support DATE, alas.
Adding auditing columns -- such as user, creation date, and update date -- to almost all tables.
Adding table-specific comments.

Doing these types of actions in a point-and-click environment is cumbersome, inefficient, and prone to error. At the same time, the GUI environment is great for designing the tables and visualizing their relationships.

So, I searched on the web for a DDL program that would allow me to create the DDL for an entire SQL Server database. Because I did not find any, I decided that I had to write something myself. The attached file contains script-all-tables.sql contains my script.

This script uses SQL to generate SQL code -- a trick that I talk about in my book . The script generates code for the following:

Dropping all tables in the database, if they exist.
Creating new versions of the tables, taking into account primary keys, data types, and identity columns.
Creating foreign key constraints on the table.
Creating indexes on the table.

This is a very common subset of DDL used for databases. And, importantly, it seems to cover almost all that you can do using Database Diagrams. However, the list of what it is missing from fully re-creating any database is very, very long, ranging from user defined types, functions, and procedures, to the storage architecture, replication, and triggers.

The script uses the view in the sys schema rather than in Information_Schema simply because I found it easier to find the information that I needed to put the SQL together.

Tuesday, February 2, 2010

Simpson's Paradox and Marketing

A reader asked the following question:

Hi Michael/Gordon,

In campaign measurements, it's possible to get a larger lift at the overall level compared to all the individual decile level lifts or vice versa, because of the differences in sample size across the deciles, and across Test & Control.

According to wikipedia, it's known as Simpson's paradox (or the Yule-Simpson effect) and is explained as an apparent paradox in which the successes in different groups seem to be reversed when the groups are combined.

In such scenarios, how do you calculate the overall lift? Which methods are commonly used in the industry?

Thanks,

Datalligence

http://datalligence.blogspot.com/

Simpson's Paradox is an interesting phenomenon, where results about subgroups of a population do not generalize to the overall population. I think the simplest version that I've heard is an old joke . . . "I heard you moved from Minnesota to Iowa, raising the IQ of both states."

How could this happen? For the joke to work, the average IQ in Minnesota must be higher than the average IQ in Iowa. And, the person who moves must have an IQ between these two values. Voila, you can get the paradox that the averages in both states go up, although they are based on exactly the same population.

I didn't realize that this paradox has a name (or, if I did, then I had forgotten). Wikipedia has a very good article on Simpson's Paradox, which includes real world examples from baseball, medical studies, and an interesting discussion of a gender discrimination lawsuit at Berkeley. In the gender discrimination lawsuit, women were accepted at a much lower rate than men overall. However, department by department, women were typically accepted at a higher rate than men. The difference is that women applied to more competitive departments than men. These departments have lower rates of acceptance, lowering the overall rate for women.

Simpson's Paradox arises when we are taking weighted averages of evidence from different groups. Different weightings can produce very different, even counter-intuitive results. The results become much less paradoxical when we see the actual counts rather than just the percentages.

The specific question is how to relate this paradox to lift, and understanding marketing campaigns. Assume there is a marketing campaign, where one group receives a particular treatment and another group does not. The ratio of performance between these two groups is the lift of the marketing campaign.

To avoid Simpson's paradox, you need to ensure that the groups are as similar as possible, except for what's being tested. If the test is for the marketing message, there is no problem, both groups can be pulled from the same population. If, instead, the test is for the marketing group itself (say high value customers), then Simpson's Paradox is not an issue, since we care about how the group performs rather than how the entire population performs.

As a final comment, I could imagine finding marketing results where Simpson's Paradox has surfaced, because the original groups were not well chosen. Simpson's Paradox arises because the sizes of the test groups are not proportional to their sizes in the overall population. In this case, I would be tempted to weight the results from each group based on the expected size in the overall population to calculate the overall response and lift.

Saturday, January 9, 2010

Hadoop and Parallel Dataflow Programming

Over the past three months, I have been teaching myself enough Hadoop to get comfortable with using the environment for analytic purposes.

There has been a lot of commentary about Hadoop/MapReduce versus relational databases (such as the articles referenced in my previous post on the subject). I actually think this discussion is misplaced because comparing open-source software with commercial software aligns people on "religious" grounds. Some people will like anything that is open-source. Some people will attack anything that is open-source (especially people who work for commercial software vendors). And, the merits of real differences get lost. Both Hadoop and relational databases are powerful systems for analyzing data, and each has its own distinct set of advantages and disadvantages.

Instead, I think that Hadoop should be compared to a parallel dataflow style of programming. What is a dataflow style of programming? It is a style where we watch the data flow through different operations, forking and combining along the way, to achieve the desired goal. Not only is a dataflow a good way to understand relational databases (which is why I introduce it in Chapter 1 of Data Analysis Using SQL and Excel), but the underlying engines that run SQL queries are dataflow engines.

Parallel dataflows extend dataflow processing to grid computing. To my knowledge, the first commercial tool that implements parallel dataflows was developed by Ab Initio. This company was a spin-off from a bleeding edge parallel supercomputer vendor called Thinking Machines that went bankrupt in 1994. As a matter of full disclosure: Ab Initio was actually formed from the group that I worked for at Thinking Machines. Although they are very, very, very resistant to sharing information about their technology, I am rather familiar it. I believe that the only publicly available information about them (including screen shots) is published in our book Mastering Data Mining: The Art and Science of Customer Relationship Management.

I am confident that Apache has at least one dataflow project, since when I google "dataflow apache" I get a pointer to the Dapper project. My wish, however, is that Hadoop were the parallel dataflow project.

Much of what Hadoop does goes unheralded by the typical MapReduce user. On a massively parallel system, Hadoop keeps track of the different parts of an HDFS file and, when the file is being used for processing, Hadoop does its darndest to keep the processing local to each file part being processed. This is great, since data locality is key to achieving good performance.

Hadoop also keeps track of which processors and disk systems are working. When there is a failure, Hadoop tries again, insulating the user from sporadic hardware faults.

Hadoop also does a pretty good job of shuffling data around, between the map and reduce operations. The shuffling method -- sorting, send, and sort again -- may not be the most efficient but it is quite general.

Alas, there are several things that Hadoop does not do, at least when accessed through the MapReduce interface. Supporting these features would allow it move beyond the MapReduce paradigm, giving it the power to support more general parallel dataflow constructs.

The first thing that bothers me about Hadoop is that I cannot easily take a text file and just copy it with the Map/Reduce primitives. Copying a file seems like something that should be easy. The problem is that a key gets generated during the map processing. The original data gets output with a key prepended, unless I do a lot of work to parse out the first field and use it as a key.

Could the context.write() function be overloaded with a version that does not output a key? Perhaps this would only be possible in the reduce phase, since I understand the importance of the key for going from map to reduce.

A performance issue with Hadoop is the shuffle phase between the map and the reduce. As I mentioned earlier, the sort-send-sort process is quite general. Alas, though, it requires a lot of work. An alternative that often works well is simply hashing. To maintain the semantics of map-reduce, I think this would be hash-send-combine or hash-send-sort. The beauty of using hashing is that the data can be sent to its destination while the map is still processing it. This allows concurrent use of the processing and network during this operation.

And, speaking of performance, why does the key have to go before the data? Why can't I just point to a sequence of bytes and use that for the key? This would enable a programming style that doesn't spend so much time parsing keys and duplicating information between values and keys.

Perhaps the most frustrating aspect of Hadoop is the MapReduce framework itself. The current version allows processing like (M+)(R)(M*). What this notation means is that the processing starts with one or more map jobs, goes to a reduce, and continues with zero or more map jobs.

THIS IS NOT GENERAL ENOUGH! I would like to have an arbitrary number of maps and reduces connected however I like. So, one map could feed two different reduces, each having different keys. At the same time, one of the reduces could feed another reduce without having to go through an intermediate map phase.

This would be a big step toward parallel dataflow parallel programming, since Map and Reduce are two very powerful primitives for this purpose.

There are some other primitives that might be useful. One would be broadcast. This would take the output from one processing node during one phase and send it to all the other nodes (in the next phase). Let's just say that using broadcast, it would be much easier to send variables around for processing. No more defining weird variables using "set" in the main program, and then parsing them in setup() functions. No more setting up temporary storage space, shared by all the processors. No more using HDFS to store small serial files, local to only one node. Just send data through a broadcast, and it goes everywhere. (If the broadcast is running on more than one node, then the results would be concatenated together, everywhere.)

And, if I had a broadcast, then my two-pass row number code (here) would only require one pass.

I think Hadoop already supports having multiple different input files into one reduce operator. This is quite powerful, and a much superior way of handling join processing.

It would also be nice to have a final sort operator. In the real world, people often do want sorted results.

In conclusion, parallel dataflows are a very powerful, expressive, and efficient way of implementing complex data processing tasks. Relational databases use dataflow engines for their processing. Using non-procedural languages such as SQL, the power of dataflows are hidden from the user -- and, some relatively simple dataflow constructs can be quite difficult to express in SQL.

Hadoop is a powerful system that emulates parallel dataflow programming. Any step in a dataflow can be implemented using a MapReduce pass -- but this requires reading, writing, sorting, and sending the data multiple times. With a few more features, Hadoop could efficiently implement parallel dataflows. I feel this would be a big boost to both performance and utility, and it would leverage the power already provided by the Hadoop framework.

Tuesday, January 5, 2010

MapReduce versus Relational Databases?

The current issue of Communications of the ACM has articles on MapReduce and relational databases. One, MapReduce a Flexible Data Processing Tool, explains the utility of MapReduce by two Google fellows -- appropriate authors, since Google invented the parallel MapReduce paradigm.

The second article, MapReduce and Parallel DBMSs: Friend or Foe, is written by a team of authors, with Michael Stonebraker listed as the first author. I am uncomfortable with this article, because the article purports to show the superiority of a particular database system, Vertica, without mentioning -- anywhere -- that Michael Stonebraker is listed as the CTO and Co-Founder on Vertica's web site. For this reason, I believe that this article should be subject to much more scrutiny.

Before starting, let me state that I personally have no major relationships with any of the database vendors or with companies in the Hadoop/MapReduce space. I am an advocate of using relational databases for data analysis and have written a book called Data Analysis Using SQL and Excel. And, over the past three months, I have been learning Hadoop and MapReduce, as attested to by numerous blog postings on the subject. Perhaps because I am a graduate of MIT ('85), I am upset that Michael Stonebraker uses his MIT affiliation for this article, without mentioning his Vertica affiliation.

The first thing I notice about the article is the number of references to Vertica. In the main text, I count nine references to Vertica, as compared to thirteen mentions of other databases:

Aster (twice)
DataAllegro (once)
DB2 (twice)
Greenplum (twice)
Netezza (once)
ParAccel (once)
PostgreSQL (once)
SQL Server (once)
Teradata (once)

The paper describes a study which compares Vertica, another database, and Hadoop on various tasks. The paper never explains how these databases were chosen for this purpose. Configuration issues for the other database and Hadoop are mentioned. The configuration and installation of Vertica -- by the absence of problems -- one assumes is easy and smooth. I have not (yet) read the paper cited, which describes the work in more detail.

Also, the paper never describes costs for the different system, which is a primary driver of MapReduce. The software is free and runs on cheap clusters of computers, rather than expensive servers and hardware. For a given amount of money, MapReduce may provide a much faster solution, since it can support much larger hardware environments.

The paper never describes issues in the loading of data. I assume this is a significant cost for the databases. Loading the data for Hadoop is much simpler . . . since it just reads text files, which is a common format.

From what I can gather, the database systems were optimized specifically for the tasks at hand, although this is not explicitly mentioned anywhere. For instance, the second tasks is a GROUP BY, and I suspect that the data is hash partitioned by the GROUP BY clause.

There are a few statements that I basically disagree with.

"Lastly, the reshuffle that occurs between the Map and Reduce tasks in MR is equivalent to a GROUP BY operation in SQL." The issue here at first seems like a technicality. In a relational database, an input row can only into one group. MR can output multiple records in the map stage, so a single row can go into multiple "groups". This functionality is important for the word count example, which is the canonical MapReduce example. I find it interesting that this example is not included in the benchmark.

"Given this, parallel DBMSs provide the same computing model as MR, with the added benefit of using a declarative language (SQL)." This is not true in several respects. First, MapReduce does have associated projects for supporting declarative languages. Second, in order for SQL to support the level of functionality that the authors claim, they need to use user defined functions. Is that syntax declarative?

More importantly, though, is that the computing model really is not exactly the same. Well, with SQL extensions such as GROUPING SETs and window functions, the functionality does come close. But, consider the ways that you can add a row number to data (assuming that you have no row number function built-in) using MapReduce versus traditional SQL. Using MapReduce you can follow the two-phase program that I described in an earlier posting. With traditional SQL, you have to do a non-equi-self join. MapReduce has a much richer set of built-in functions and capabilities, simply because it uses java, an established programming language with many libraries.

On the other hand, MapReduce does not have a concept of "null" built-in (although users can define their own data types and semantics). And, MapReduce handles non-equijoins poorly, because the key is used to direct both tables to the same node. In effect, you have to limit the MapReduce job to one node. SQL can still parallelize such queries.

"[MapReduce] still requires user code to parse the value portion of the record if it contains multiple attributes." Well, parse is the wrong term, since a Writable class supports binary representations of data types. I describe how to create such types here.

I don't actually feel qualified to comment on many of the operational aspects of optimizing Hadoop code. I do note that the authors do not explain the main benefit of Vertica, which is the support of column partitioning. Each column is stored separate, which makes it possible to apply very strong compression algorithms to the data. In many cases, the Vertica data will fit in memory. This is a huge performance boost (and one that another vendor, Paracel takes advantage of).

In the end, the benchmark may be comparing the in-memory performance of a database to general performance for MapReduce. The benchmark may not be including the ETL time for loading the data, partitioning data, and building indexes. The benchmark may not have allocated optimal numbers of map and reduce jobs for the purpose. And, it is possible that the benchmark is unbiased and relational databases really are better.

A paper that leaves out the affiliations between its authors and the vendors used for a benchmark is only going to invite suspicion.

Saturday, January 2, 2010

Hadoop and MapReduce: Normalizing Data Structures

To set out to learn Hadoop and Map/Reduce, I tackled several different problems. The last of these problems is the challenge of normalizing data, a concept from the world of relational databases. The earlier problems were adding sequential row numbers and characterizing values in the data.

This posting describes data normalization, explains how I accomplished it in Hadoop/MapReduce, and some tricks in the code. I should emphasize here that the code is really "demonstration" code, meaning that I have not worked hard on being sure that it always works. My purpose is to demonstrate the idea of using Hadoop to do normalization, rather than producing 100% working code.

What is Normalization and Why Do Want To Do It?

Data normalization is the process of extracting values from a single column and placing them in a reference table. The data used by Hadoop is typically unnormalized, meaning that data used in processing is in a single record, so there is no need to join in reference tables. In fact, doing a join is not obvious using the MapReduce primitives, although my understanding is that Hive and Pig -- two higher level languages based on MapReduce -- do incorporate this functionality.

Why would we want to normalize data? (This is a good place to plug my book Data Analysis Using SQL and Excel, which explains this concept in more detail in the first chapter.) In the relational world, the reason is something called "relational integrity", meaning that any particular value is stored in one, and only one, place. For instance, if the state of California were to its name, we would not want to update every record from California. Instead, we'd rather go to the reference table and just change the name to the new name, and the data field contains a state id rather than the state name itself. Relational integrity is particularly important when data is being updated.

Why would we want to normalize data used by Hadoop? There are two reasons. The first is that we may be using Hadoop processing to load a relational database -- one that is already designed with appropriate reference tables. This is entirely reasonable, relational databases are an attractive way to "publish" results from complex data processing since they are better for creating end-user reports and building interactive GUI interfaces.

The second reason is performance. Extracting long strings and putting them in a separate reference table can significantly reduce the storage requirements for the data files. By far, most of the space taken up in typical log files, for instance, consists of long URIs (what I used to call URLs). When processing the log files, we might want to extract some features from the URIs, but keeping the entire string just occupies a lot of space -- even in a compressed file.

The Process of Normalizing Data

Normalizing data starts with data structures. The input records are assumed to be in a delimited format, with the column names in the first row (or provided separately, although I haven't tested that portion of the code yet). In addition, there is a "master" id file that contains the following columns:

id -- a unique id for every value by column.
column name -- the name of the column.
value -- the id in the column.
count -- the total number of times the value as so far occurred.

This is a rudimentary reference file. I could imagine, for instance, having more information than just the count as summary information -- perhaps the first and last date when the value occurs, for instance.

What happens when we normalize data? Basically, we look through the data file to find new values in each column being normalized. We append these new values into the master id file, and then go back to the original data and replace the values with the ids.

Hadoop is a good platform for this for several reasons. First, because the data is often stored as text files, the values and the ids have the same type -- text strings. This means that the file structures remain the same. Second, Hadoop can process multiple columns at the same time. Third, Hadoop can use inexpensive clusters and free software for this task, rather than relying on databases and tools, which are often more expensive.

How To Normalize Data Using Hadoop/MapReduce

The normalization process has six steps. Most of these correspond to a single Map-Reduce pass.

Step 1: Extract the column value pairs from the original data.

This step explodes the data, by creating a new data set with multiple rows for each row in the original data. Each output row contains a column, a value, and the number of times the value appears in the data. Only columns being normalized are included in the output.

This step also saves the column names for the data file in a temporary file. I'll return to why this is needed in Step 6.

Step 2: Extract column-value Pairs Not In Master ID File

This step compares the column-value pairs produced in the first step with those in the master id file. This step is interesting, because it reads data from two different data source formats -- the master id file and the results from Step 1. Both sets of data files use the GenericRecord format.

To identify the master file, the map function looks at the original data to see whether "/master" appears in the path. Alternative methods would be to look at the GenericRecord that is created or to use MultipleInputs (which I didn't use because of a warning on Cloudera's web site).

Step 3: Calculate the Maximum ID for Each Column in the Master File

This is a very simple Map-Reduce step that simply gets the maximum id for each column. New ids that are assigned will be assigned one more than this value.

This is an instance where I would very much like to have two different reduces following a map step. If this were possible, then I could combine this step with step 2.

Step 4: Calculate a New ID for the Unmatched Values

This is a two step process that follows the mechanism for adding row numbers discussed in one of my earlier posts, with one small modification. The final result has the maximum id value from Step 3 added onto it, so the result is a new id rather than just a row number.

Step 5: Merge the New Ids with the Existing Master IDs

This step merges in the results from Step 4 with the existing master id file. Currently, the results are placed into another directly. Eventually, they could simply override the master id file.

Because of the structure of the Hadoop file system, the merge could be as simple as copying the file with the new ids into the appropriate master id data space. However, this would result in an unbalanced master id file, which is probably not desirable for longer term processing.

Step 6: Replace the Values in the Original Data with IDs

This final step replaces the values with ids -- the actual normalization step. This is a two part process. The map phase of the first part takes both the original data and the master key file. All the column value pairs are exploded from the original data, as in Step 1, with the output consisting of:

key: :
value: <"expect"|"nomaster">, ,

The first part ("expect" or "nomaster") is an indicator of whether this column should be normalized (that is, whether or not to expect a master id). The second field identifies the original data record, which is uniquely identified by the partition id and row number within that partition. The third is the column number in the row.

The master records are placed in the format:

key: :
value: "master",

The reduce then reads through all the records for a given column-value combination. If one of them is a master, then it outputs the id for all records. Otherwise, it outputs the original value.

The last phase simply puts the records back together again, from their exploded form. The one trick here is that the metadata is read from a local file.

Tricks Used In This Code

The code is available in these files: Normalize.java, GenericRecordInputFormat.java, GenericRecord.java, and GenericRecordMetadata.java. This code uses several tricks along the way.

One trick that I use in Step 4, for the phase 1 map, makes the code more efficient. This phase of the computation extracts the maximum row number for each column. Instead of passing all the row numbers to a combine or reduce function, it saves them in a local hash-map data structure. I then use the cleanup() routine in the map function to output the maximum values.

Often the master code needs to pass variables to the map/reduce jobs. The best way to accomplish this is by using the "set" mechanism in the Configuration object. This allows variables to be assigned a string name. The names of all the variables that I use are stored in constants that start with PARAMETER_, defined at the beginning of the Normalize class.

In some cases, I need to pass arrays in, for instance, when passing in the list of column that are to be normalized. In this case, one variable gives the number of values ("normalize.usecolumns.numvals"). Then each value is stored in a variable such as "normalize.usecolumns.0" and "normalize.usecolumns.1" and so on.

Some of the important processing actually takes place in the master loop, where results are gathered and then passed to subsequent steps using this environment mechanism.

The idea behind the GenericRecord class is pretty powerful, with the column names at the top of the file. GenericRecords make it possible to read multiple types of input in the same map class, for instance, which is critical functionality for combining data from two different input streams.

However, the Map-Reduce framework does not really recognize these column names as being different, once generic records are placed in a sequence file. The metadata has to be passed somehow.

When the code itself generates the metadata, this is simple enough. A function is used to create the metadata, and this function is used in both the map and reduce phases.

A bigger problem arises with the original data. In particular, Step 6 of the above framework re-creates the original records, but it has lost the column names, which poses a conundrum. The solution is to save the original metadata in Step 1, which first reads the records. This metadata is then passed into Step 6.

In this code, this is handled by simply using a file. The first map partition of Step 1 writes this file (this partition is used to guarantee that the file is written exactly once). The last reduce in Step 6 then reads this file.

This mechanism works, but is not actually the preferred mechanism, because all the reduce tasks in Step 6 are competing to read the same file -- a bottleneck.

A better mechanism is for the master program to read the file and to place the contents in variables in the jar file passed to the map reduce tasks. Although I do this for other variables, I don't bother to do this for the file.