Showing posts with label database. Show all posts
Showing posts with label database. Show all posts

Tuesday, August 26, 2014

An Achievement on Stack Overflow

Last Friday (August 22nd), I achieved a milestone on Stack Overflow, a computing question-and-answer site, by hitting 200,000 points.  So far this year, I have also earned more points than anyone, a testament to my obsession with the site (you can see yearly user rankings here).  As for overall points, I will never match the leader Jon Skeet because who had a head start of many years.

My answers are almost almost exclusively for answers related to SQL and various databases.   The site is highly geared toward "tools" questions, so there are few general analysis questions.

So, this blog is sharing some of my thoughts and my history on the site.

Clearly, I have helped a lot of people on Stack Overflow, around the world.  The most rewarding part are the thank-yous from real people working on real problems.  On several instances, I have helped speed up code by more than 99% -- turning hours of drudgery into seconds or minutes of response time.

But answering questions has helped me too:
  • My technical knowledge of databases has greatly improved, particularly the peculiarities (strengths and weaknesses) of each database engine.
  • I have learned patience for people who are confused by concepts in SQL.
  • I have (hopefully) learned how to explain concepts to people with different levels of competence.
  • I have learned how to figure out (sometimes) what someone is really asking.
  • I have a strong appreciation for what SQL can do and what SQL cannot do.
  • It has definitely increased the number of hits when I egosurf.
A few months after starting, I stopped down voting questions and answers.  "Down voting" is usually seen as "not-nice", making the other person defensive, confused, and perhaps angry.   A lesson for real life:  being mean (by down voting) is not nearly so useful as offering constructive criticism (by commenting).


This all started in January, 2012 (a bit over two and a half years ago).  The reason was simple:  I was writing a system called The Netting Machine for the Lehman Brothers Estate and it was stretching my knowledge of SQL Server.  One particular problem involved dynamic queries.  Google kept directing me to the same question on Stack Overflow.  This best answer was close to what I needed, but not quite.  It was only half-way there.  The third time I landed on the page, I added my own answer.  This was actually for quite selfish reasons:  the next time Google took me there, I wanted to see the full answer.

Lo and behold, my answer was accepted and up voted.  When the OP ("original poster" -- Stack Overflow lingo for the person asking the question) accepted my answer, s/he had to unaccept another.  That answer was by Aaron Bertrand, a SQL Server guru whose name I recognized from his instructive blog posts.  Aaron commented about the "unaccept".  In the back of my mind, If Aaron thinks this is important, then there must be something to it.  Ultimately, I can blame Aaron (whom I have not yet met in person) for getting me hooked.

For a few months, I sporadically answered questions.  Then, in the first week of May, my Mom's younger brother passed away.  That meant lots of time hanging around family, planning the funeral, and the like.  Answering questions on Stack Overflow turned out to be a good way to get away from things.  So, I became more intent.

Stack Overflow draws you in not only with points but with badges and privileges.  Each time I logged in, the system "thanked" me for my participation with more points, more badges, and more privileges.   This continued.  One day (probably in June), I hit the daily upvote maximum of 200 upvotes (you also get points when an answer is accepted or someone offers a bounty).  One week, I hit 1000 points.  One month, 5,000 points.  As an individual who is mesmerized by numbers, I noticed these things.

Last summer, I hit 100,000 points in September and slowed down.  I figured that six figures was enough, and I had other, more interesting things to do -- a trip to Myanmar, my sister's wedding, our apartment in Miami, classes to teach (San Francisco, Amsterdam, Chicago, San Antonio, Orlando) and so on.

I didn't start 2014 with the intention of spending too much of my time on the site.  But three things happened in January.  The first was a fever with a rash on my face.  It kept me home with not-enough to do.  So, I answered questions on Stack Overflow.  Then, I had an attack of gout.  That kept me home with not-enough to do.  And finally, the weather in January in New York was, well, wintery -- lots of cold and lots of snowy.  More reasons to stay home and answer questions.

By the end of January, I was the top scorer for the month.  "Hey, if I can do it in January, let's see what happens in February."  I had thought of relenting in April:   we flew to Greece and spent two nights on Mount Athos.  Mount Athos is a peninsula in northern Greece, devoted to twenty-one Orthodox monasteries -- and nothing else.  It is inhabited by a few thousand monks living medieval lifestyles.  The only way to visit is as a "pilgrim", staying at a monastery.  An incredible experience.  No internet.  But, I was able to make up the point deficit on Stack Overflow.

This year, each month that passes is another month where I seem to be the top point-gatherer on Stack Overflow.  At this point, I might as well make it to the end of the year.  I don't know if I will, but I do hope to help a few other people and to learn more about databases and how people are using them.




Tuesday, March 25, 2014

Three SQL Constructs You Can Forget About

SQL is a very powerful language, which could, of course, be made even more powerful and useful.  This post discusses three features of the language -- ANSI standard features -- that seem not only unnecessary but downright detrimental.  That is, they seem to cause much more confusion than they provide in functionality.  And, in all these cases, it would be easy to work around their absence.

Although it would be nice to remove these from the language itself, that is unlikely to happen.  However, they can be de-prioritized for people learning SQL.  These constructs are easy to work around and are less functional than their alternatives.  When learning SQL, these should be learned later in the process.

(1)  INSERT . . . VALUES()

The first construct is the use of VALUES with INSERT, as in:

insert into t(col1)    values(1);

In almost every database, this is easily replaced with:

insert into t(col1)
    select 1;
In some databases, you might have to add a from dual or from sys.dummy to make this work.
And, in every respect except one, the INSERT . . . SELECT method is better.   For instance, you can add a WHERE clause to be sure that the value doesn't already exist:
insert into t(col1) select 1 where not exists (select 1 from table t2 where t2.col1 = t)

Or, you can readily add other values, from this or another table:
insert into t(col1, col2)    select 1, (select count(*) from t2)
Trying to fit this into a VALUES statement just causes syntax errors and confusion.

And, you can use UNION ALL to add multiple rows at the same time.

The VALUES statement has exactly one advantage and that is the fact that it is standard.  The same code will work in multiple databases.  That seems very minor.  It would be better if the standard had a way of using SELECT to return a row without a table.


(2)  SELECT DISTINCT

The next unnecessary construct is SELECT DISTINCT.   First, this is easily replaced with GROUP BY.  So:
select distinct a, b, cfrom t;
is the same as:
select a, b, cfrom tgroup by a, b, c;

What makes the GROUP BY better?   Primarily the fact that you can have a HAVING clause.

So, SELECT DISTINCT is sometimes understood to be:  "Get me all the rows that are distinct".  Rather than, "Get me the distinct values from all the rows."  Actually, that first interpretation makes a lot of sense, even if it is wrong.  Not only is there no danger of confusion with the GROUP BY, but including HAVING COUNT(*) = 1 actually solves the first problem.    No way to do that with SELECT DISTINCT.

The second problem is perhaps more dangerous.  Have you ever seen someone write this?
select distinct(a) b, cfrom t;
Here, the DISTINCT seems to be used like a function.   The intention is "Get me distinct values of a along with arbitrary values of b and c".  Of course, this is exactly the same with or without the parentheses.  DISTINCT is not a function.  This usage is so prevalent that Postgres introduced the DISTINCT ON syntax to support it.

What advantages does SELECT DISTINCT have?  The syntax is shorter and you don't have to repeat the column names in a GROUP BY clause.    In a world of cut-and-paste, copying the column to GROUP BY is negligible effort.   And, it does allow SELECT DISTINCT *.   However that is a construct that I wouldn't miss at all.


(3)  COUNT(column)

Finally, there is the COUNT aggregation function with a column as an argument.  Just to be clear, I have no problem with COUNT(DISTINCT column) or COUNT(*) or COUNT(1).

No doubt, the designed of SQL were obsessed with NULL values (and despite the obsession, they still didn't get it right).   Wouldn't everyone in the world (who uses SQL) want to count the number of non-NULL values in a column?  What else could COUNT(column) mean?

Well, in many contexts, people probably think it means COUNT(DISTINCT column).  Consider the following query:
select c.country, count(c.CustomerId), count(o.OrderId)from Customers c join     Orders o     on c.CustomerId = o.CustomerId;

Many people might write this code, just like this, with the intention of getting the number of customers and the number of orders in each country.  How sad when they learn that these are the same!  There are no repeat purchasers anywhere.  (COUNT(DISTINCT c.CustomerId) fixes this problem.)

Such confusion would be a non-issue.

And, if you wanted to count non-NULL values?  Why not do it explicitly, so you can remember what the query is supposed to be doing:
select sum(case when a is not null then 1 else 0 end)
Yes, this takes a bit more typing but the query is much clearer on what it is doing.  It would be much shorter if all databases supported the "boolean" is an "integer" shortcut:
select sum(a is not null)

(4) ,

What is a list of three things without a fourth to cap it off?  Just don't use a comma in the FROM clause.  Explicit join syntax is more expressive and clearer in every case.  The , can be replaced by CROSS JOIN.

Thursday, March 20, 2014

Big Data and SQL

I happen to think that SQL is a very viable option for analyzing big data.  I was thinking about this when I a book review recently:
For instance, Siegel reports, people who buy small felt pads that adhere to the bottom of chair legs (to protect the floor) are more likely than others to be good credit risks.
For some people, results like this conjure up magic.  PhDs in white coats bustling around, surrounded by acres of machines humming away pondering this imponderable problem (or is that the air conditioning making the noise).  In fact, something like this is readily calculated from a normal decision support database containing historical data.

So, how hard is it to write the SQL?

The place to start is to rephrase the question.  Let's ask it as:
For all products purchased by customers in 2013, what is the non-payment rate for the first three months of 2014?
Note that this is carefully phrased as a "before" and "after" problem.  Although that does not guarantee causality, it does help.

Next, assume that we have the following tables:

  • Customers
  • Orders
  • OrderProducts
  • Invoices (monthly, with a flag to indicate non-payment)
The following query gets all the products from 2013:

select op.ProductId, count(*) as NumProducts,
       count(distinct o.CustomerId) as NumCustomers
from Orders o join
     OrderProducts op
     on o.OrderId = op.OrderId
where o.OrderDate >= '2013-01-01' and
      o.OrderDate < '2014-01-01'
group by op.ProductId;


The following gets all customers who didn't pay in the first three months of 2014.  This might look something like:

select i.CustomerId
from Invoices i
where i.InvoiceDate >= '2014-01-01' and
      i.InvoiceDate < '2014-04-01' and
      i.NotPaid = 1;

These can then easily be combined to get a list of products, by the proportion of customers who did not pay:

select ProductId, count(*) as NumCustomers,
       count(pc.CustomerId) as numNotPaid,
       count(*)*1.0 / count(pc.CustomerId) as NonPayRate
from (select op.ProductId, op.CustomerId
      from Orders o join
           OrderProducts op
           on o.OrderId = op.OrderId
      where o.OrderDate >= '2013-01-01' and
            o.OrderDate < '2014-01-01'
      group by op.ProductId, op.CustomerId

     ) pc left outer join
     (select i.CustomerId
      from Invoices i
      where i.InvoiceDate >= '2014-01-01' and
            i.InvoiceDate < '2014-04-01' and
            i.NotPaid = 1
     ) np
     on pc.CustomerId = np.CustomerId
group by pc.ProductId
order by NonPayRate desc;

This isn't a particularly complex SQL.  Instead, we can think about what is really important.  The first is being willing to ask the question.  I think a major constraint in business is that managers and executives are hesitant to ask questions.  They don't have a sense of what is "easy" to answer and what is "hard".  They also fear getting different answers from different people.

The second is the interpretation.  The statement that people who want to protect their furniture are better credit risks has a nice warm and fuzzy quality:  people who care about their belongings also care about their credit.  Perhaps other factors are at work.  People buy new furniture and want to protect it because they have access to cash or credit -- they may simply be richer than other people at least for a period of time.  Or, felt pads may only be sold in areas where people tend to own their homes, so there is a store-bias in the merchandizing.  Or, customers who buy these small items may be paying in cash and never make larger purchases that might measure credit risk.

To understand what is really happening would require further analysis.  To get started just takes asking some insightful questions.






Tuesday, January 5, 2010

MapReduce versus Relational Databases?

The current issue of Communications of the ACM has articles on MapReduce and relational databases. One, MapReduce a Flexible Data Processing Tool, explains the utility of MapReduce by two Google fellows -- appropriate authors, since Google invented the parallel MapReduce paradigm.

The second article, MapReduce and Parallel DBMSs: Friend or Foe, is written by a team of authors, with Michael Stonebraker listed as the first author. I am uncomfortable with this article, because the article purports to show the superiority of a particular database system, Vertica, without mentioning -- anywhere -- that Michael Stonebraker is listed as the CTO and Co-Founder on Vertica's web site. For this reason, I believe that this article should be subject to much more scrutiny.

Before starting, let me state that I personally have no major relationships with any of the database vendors or with companies in the Hadoop/MapReduce space. I am an advocate of using relational databases for data analysis and have written a book called Data Analysis Using SQL and Excel. And, over the past three months, I have been learning Hadoop and MapReduce, as attested to by numerous blog postings on the subject. Perhaps because I am a graduate of MIT ('85), I am upset that Michael Stonebraker uses his MIT affiliation for this article, without mentioning his Vertica affiliation.

The first thing I notice about the article is the number of references to Vertica. In the main text, I count nine references to Vertica, as compared to thirteen mentions of other databases:
  • Aster (twice)
  • DataAllegro (once)
  • DB2 (twice)
  • Greenplum (twice)
  • Netezza (once)
  • ParAccel (once)
  • PostgreSQL (once)
  • SQL Server (once)
  • Teradata (once)
The paper describes a study which compares Vertica, another database, and Hadoop on various tasks. The paper never explains how these databases were chosen for this purpose. Configuration issues for the other database and Hadoop are mentioned. The configuration and installation of Vertica -- by the absence of problems -- one assumes is easy and smooth. I have not (yet) read the paper cited, which describes the work in more detail.

Also, the paper never describes costs for the different system, which is a primary driver of MapReduce. The software is free and runs on cheap clusters of computers, rather than expensive servers and hardware. For a given amount of money, MapReduce may provide a much faster solution, since it can support much larger hardware environments.

The paper never describes issues in the loading of data. I assume this is a significant cost for the databases. Loading the data for Hadoop is much simpler . . . since it just reads text files, which is a common format.

From what I can gather, the database systems were optimized specifically for the tasks at hand, although this is not explicitly mentioned anywhere. For instance, the second tasks is a GROUP BY, and I suspect that the data is hash partitioned by the GROUP BY clause.

There are a few statements that I basically disagree with.

"Lastly, the reshuffle that occurs between the Map and Reduce tasks in MR is equivalent to a GROUP BY operation in SQL." The issue here at first seems like a technicality. In a relational database, an input row can only into one group. MR can output multiple records in the map stage, so a single row can go into multiple "groups". This functionality is important for the word count example, which is the canonical MapReduce example. I find it interesting that this example is not included in the benchmark.

"Given this, parallel DBMSs provide the same computing model as MR, with the added benefit of using a declarative language (SQL)." This is not true in several respects. First, MapReduce does have associated projects for supporting declarative languages. Second, in order for SQL to support the level of functionality that the authors claim, they need to use user defined functions. Is that syntax declarative?

More importantly, though, is that the computing model really is not exactly the same. Well, with SQL extensions such as GROUPING SETs and window functions, the functionality does come close. But, consider the ways that you can add a row number to data (assuming that you have no row number function built-in) using MapReduce versus traditional SQL. Using MapReduce you can follow the two-phase program that I described in an earlier posting. With traditional SQL, you have to do a non-equi-self join. MapReduce has a much richer set of built-in functions and capabilities, simply because it uses java, an established programming language with many libraries.

On the other hand, MapReduce does not have a concept of "null" built-in (although users can define their own data types and semantics). And, MapReduce handles non-equijoins poorly, because the key is used to direct both tables to the same node. In effect, you have to limit the MapReduce job to one node. SQL can still parallelize such queries.

"[MapReduce] still requires user code to parse the value portion of the record if it contains multiple attributes." Well, parse is the wrong term, since a Writable class supports binary representations of data types. I describe how to create such types here.

I don't actually feel qualified to comment on many of the operational aspects of optimizing Hadoop code. I do note that the authors do not explain the main benefit of Vertica, which is the support of column partitioning. Each column is stored separate, which makes it possible to apply very strong compression algorithms to the data. In many cases, the Vertica data will fit in memory. This is a huge performance boost (and one that another vendor, Paracel takes advantage of).

In the end, the benchmark may be comparing the in-memory performance of a database to general performance for MapReduce. The benchmark may not be including the ETL time for loading the data, partitioning data, and building indexes. The benchmark may not have allocated optimal numbers of map and reduce jobs for the purpose. And, it is possible that the benchmark is unbiased and relational databases really are better.

A paper that leaves out the affiliations between its authors and the vendors used for a benchmark is only going to invite suspicion.

Sunday, October 19, 2008

Rolling and Unrolling Correlated Subqueries in SQL

The subject of correlated subqueries arose recently in a data mining class I was teaching. A student inquired about improving the performance of a particular query, which happened to have a correlated subquery. This posting discusses unrolling correlated subqueries to improve performance as well as the rarer need to use correlated subqueries to increase performance.

Correlated subqueries are SQL queries that contain a nested subquery, where the nested query refers to one or more outside tables. The definition sounds complicated, but an example is worth a thousand words.

My book Data Analysis Using SQL and Excel includes a database of customers, orders, and transactions (which can be downloaded). From such data, we might ask a question such as "What products did customer X order on her or his earliest order date?" A typical way to answer this is with a corrrelated subquery.

SELECT ol.ProductID
FROM orders o JOIN
.....orderline ol
.....ON o.OrderID = ol.OrderID AND
.....o.CustomerID = X
WHERE o.OrderDate = (SELECT MIN(OrderDate)
.....................FROM orders o2
.....................WHERE o2.CustomerID = o.CustomerID)


Since this is standard SQL, all reasonable relational databases should support this syntax. One syntax note: the subquery could optionally contain a "GROUP BY o2.CustomerID" clause.

What is the query doing? It is joining two tables together (orders and orderline) and then restricting the results to a single customer. However, the query is about the products in a particular order, so the WHERE clause selects the particular order -- as the one with the smallest OrderDate. Voila. The query answers the question.

The correlated subquery is in the WHERE clause, buried in the subquery in the line o.OrderID = o2.OrderID. This is placing a restriction on the values in the subquery based on the results of an outer query. Do note that if the WHERE clause were instead o.CustomerID = , then the subquery would not be correlated, since there would be no connection to the outer tables.

So far so good. When we think of how the query runs, we think of iterating through every row in the o2 table and looking to match it to the current value in the o table. If there is an index, so much the better because the query engine can use the index to access the o2 table.

This conceptual approach is, in fact, how most (if not all) query engines optimize such a query. For now, I'm leaving open the question of whether this is a good thing, in order to present the idea of unrolling the subquery.

There are other ways to answer the original question ("What products did Customer X order on his or her earliest order date?"). The following query shows an alternative approach:

SELECT ProductID
FROM orders o JOIN
.....orderlines ol
.....ON o.OrderID = ol.OrderID JOIN
.....(SELECT CustomerID, MIN(OrderDate) as minOrderDate
......FROM orders
......GROUP BY CustomerID) omin

.....ON o.OrderDate = omin.minOrderDate AND
........o.CustomerID = omin.CustomerID
WHERE o.CustomerID = X

This version of the query unrolls the subquery, by creating a summary table with the earliest order date for all customers. The link to the other table is made through an explicit join condition between this summary table and the orders table.

Note that in this particular query, the WHERE clause that chooses the customer could be in the subquery, because the columns in the WHERE clause are in the subquery. However, in the general case, the filter could be using columns not available in the subquery -- such as getting all products that start with the letter "A".

There is a big difference in how this query gets executed versus the earlier version. The big difference is that now the orders need to be grouped to find the earliest order date for all orders. The correlated subquery could use an index and only look at the handful of rows for a given customer. So, the correlated subquery seems to be more efficient.

If the correlated subquery is more efficient, then why do I personally avoid using them? One reason is the explicitness of the joins. I find it much easier to understand the unrolled version. However, ease of understanding is less important than performance. In many cases, the unrolled version does execute faster.

Notice that both these queries are looking for data about one particular customer -- a small subset of the overall data. For queries that are looking for such needles in the haystack, then correlated subqueries are fine.

However, decision support queries are usually looking to sift through the whole haystack and not look for just the needle. If we changed the question to "What products are ordered on the earliest order date?" then the queries lose the restrictive clause limiting them to one customer. Now what happens?

In the case of the correlated subquery, query engines essentially execute the joins in one of two ways: (1) by repeatedly looping through one table (typically the one in the inner join) or (2) using indexes. In terms of join algorithms, these are nested loop joins and index-based joins -- two perfectly good join algorithms. But, I might add, two out of many algorithms that could be used.

On the other hand, doing the explicit join as in the second example allows the query engine to execute the different steps it needs to execute, and then to decide on the best strategies. In particular, when the data is partitioned for simultaneous access on multiple processors, most query engines would forget the parallel possibilities and simply execute the correlated subquery on a single processor.

On the other hand, most parallel query engines would correctly parallelize the second version of the query. The GROUP BY would execute in parallel, as would the rest of the joins. The query optimizer would use table statistics to generate the best query plan.

Correlated subqueries are a tool used when designing queries. In all cases, though, the subqueries can be unrolled using more traditional aggregation and join operations. However, query optimizers generally do not perform this operation.

Correlated subqueries are often the most efficient approach when looking for a few rows from a table, particularly when the optimizer can use indexes for the join. On the other hand, unrolling the subqueries is often more efficient when there is a large amount of data, because the optimizer can do full query optimization, making use of parallelism and table statistics.

Currently, most query optimizers do not know how to unrolls correlated subqueries -- or how to roll them back up. So, we need to make such decisions when writing the queries ourselves.

Tuesday, August 26, 2008

MapReduce Functionality in Commercial Databases

If you use LinkedIn, then you have probably been impressed by their "People you may know" feature. I know that I have. From old friends and colleagues to an occasional person I don't necessarily want to see again, the list often contains quite familiar names.

LinkedIn is basically a large graph of connections among people, enhanced with information such as company names, date of link, and so on. We can imagine how they determine whether someone might be in someones "People you may know category", by using common names, companies, and even common paths (people who know each other).

However, trying to imagine how they might determine this information using SQL is more challenging. SQL provides the ability to store a graph of connections. However, traversing the graph is rather complicated in standard SQL. Furthermore, much of the information that LinkedIn maintains is complicated data -- names of companies, job titles, and dates, for instance.

It is not surprising, then, that they are using MapReduce to develop this information. The surprise, though, is that their data is being stored in a relational database, which provides full transactional-integrity and SQL querying capabilities. The commercial database software that supports both is provided by a company called Greenplum.

Greenplum has distringuished itself from other next-generation database vendors by incorporating MapReduce into its database engine. Basically, Greenplum developed a parallel framework for managing data, ported Postgres into this framework, and now has ported MapReduce as well. This is a strong distinction from other database vendors that provide parallel Postgres solutions, and particularly well suited to complex datatypes encountered on the web.

I do want to point out that the integration of MapReduce is at the programming level. In other words, they have not changed SQL; they have added a programming layer that allows data in the database to be readily accessed using MapReduce primitives.

As I've discussed in other posts, MapReduce and SQL are complementary technologies, each with their own strengths. MapReduce can definitely benefit from SQL functionality, since SQL has proven its ability for data storage and access. On the other hand, MapReduce has functionality that is not present in SQL databases.

Now that a database vendor has fully incorporated MapReduce into its database engine, we need to ask: Should MapReduce remain a programming paradigm or should it be incorporated into the SQL query language? What additional keywords and operators and so on are needed to enhance SQL functionality to include MapReduce?

Tuesday, July 29, 2008

Nested Subqueries in SQL

A recent question:

You used a lot of multi-layer subqueries. Equivalently, I think we can create intermediate tables or views and query them. It's easier for me to build, to follow, and to debugg especially from dataflow diagram. But I do believe the two approaches will result in different time and space required. Could you elaborate on the difference?

[Note: This question is about the the book Data Analysis Using SQL and Excel.]
This is a good question. Ironically, I received the email while I was sitting in an office in Kuala Lumpur, Malaysia writing very complicated nested queries running a very remote database (the database is actually in San Diego).

In this case, the need for complicated nested subqueries was obvious: I (and the group I was working with) only have read access into the database. And for good reason. Although the database contains key analytic information, it is an operational database. In such cases, analysts often have only read access. (By the way, this brings up a question for Oracle: Why can't a user with read-only access explain a query?)

This very immediate experience provides the first answer to the question. In some circumstances, it is not possible or desireable to write to the database.

However, that is only the first reason. There are other reasons.

One of the jobs of databases is planning the most efficient execution plan. For instance, a query might join several tables together. If we do these joins with intermediate tables, say, two at a time, then we impose an ordering on them. However, one of the most important parts of a SQL optimizer is the part that chooses the ordering of joins. Some ways of doing joins are more efficient than other ways.

So, a second reason is that explicitly storing results in intermediate tables might prevent the SQL optimizer from choosing the most efficient query plan.

The third reason also has to do with the database engine. Database engines manage storage. A complex query plan may produce several intermediate results in temporary tables. The writer of the query does not need to name these tables, keep track of the storage, or remember to delete them. The query engine does this automatically.

Doing analysis in a single query (versus in a script file) saves time and effort in storage management.

Storing data in intermediate tables may also impose constraints on the intermediate tables. In particular, the intermediate tables may not be parallel or the table space used for storage may be inefficient in some other way. Using subqueries eliminates any dependency on possible inefficient temporary user storage.

Another reason has to do with columns in intermediate results that are not used. Eliminating columns in query processing can be significant for improving efficiency of queries. When storing results in an intermediate table, all the columns are stored. When using subqueries, the query plan should include only columns actually needed.

Subqueries allow the query plan to eliminate unused columns from intermediate tables.

A final reason for me is really a personal preference. The task of maintaining separate queries and tables, especially using naming conventions is cumbersome. If I store results in a temporary table, I want the name to mean something. If I'm using a subquery, then the name is less important. Or, if I change the name or type of a column, then I find it easier to make the change in one place rather than distributed through a script file. Using a subquery reduces my need to think of new names. Admittedly, this reason is really personal preference.

In general, writing complicated subqueries actually allows the query optimizer to do what it does best -- determine the most efficient execution plan for running the query. Although there are some exceptions, the general rule is quite simple: try to focus on the questions being answered and not how the results are combined. SQL may not be a perfect language for processing, but it does allow us to do very complex data manipulations with a minimum of programming concerns.

Tuesday, April 8, 2008

Databases, MapReduce, and Disks

I just came across an interesting blog posting by Tom White entitled "Disks Have Become Tapes". This is an interesting posting, but it makes the following claim: relational databases are limited by the seek speed of disks whereas MapReduce-based methods take advantage of the streaming capabilities of disks. Hence, MapReduce is better than RDBMS for various types of processing.

Once again, I read a comment in a blog that seems misguided and gives inaccurate information. My guess is that people learn relational databases from the update/insert perspective and don't understand complex query processing. Alas. I do recommend my book Data Analysis Using SQL and Excel for such folks. Relational databases can take advantage of high-throughput disks.

Of course, the problem is not new. Tom White quotes David DeWitt quoting Jim Gray saying "Disks are the new tapes" (here). And the numbers are impressive. It takes longer to read a high capacity disk now than it did twenty years ago, because capacity has increased much faster than transfer rates. As for random seeks on the disk, let's not go there. Seek times have hardly improved at all over this time period. Seeking on a disk is like going to Australia in a canoe -- the canoe works well enough to cross a river, so why not an ocean? And, as we all know, RDBMSs use a lot of seeks for queries so they cannot take advantage of modern disks. MapReduce to the rescue!

Wait, is that common wisdom really true?

It is true that for updating or fetching a single row, an RDBMS does use disk seeks to get there (especially if there is an index). However, this is much faster than the alternative of streaming through the whole table -- even on a fancy, multi-cheap processor MapReduce systems connected to zillions of inexpensive disks.

On a complex query, the situation is a bit more favorable to the RDBMS for several reasons. First, large analytic queries typically read entire tables (or partitions of tables). They do not "take advantage" of indexing, since they read all rows using full table scans.

However, database engines do not read rows. They read pages. Between the query processor and the data is the page manager. Or, as T. S. Elliott wrote in his poem "The Hollow Men" [on an entirely different topic]:

Between the idea
And the reality
Between the motion
And the act
Falls the shadow

In this case, the shadow is the page manager, a very important part but often overlooked component of a database management system.

Table scans read the pages assigned to a table. So, query performance is based on a balance of disk performance (both throughput and latency) and page size. For a database used for analytics, use a big page size. 4k is way small . . . 128k or even 1Mbyte could be very reasonable (and I have seen systems with even larger page sizes). Also, remember to stuff the pages full. There is no reason to partially fill pages unless the table has updates (which is superfluous for most data warehouse tables).

Databases do a lot of things to improve performance. Probably the most important boost is accidental. Large database tables are typically loaded in bulk, say once-per-day. As a result, the pages are quite likely to be allocated sequentially. Voila! In such cases, the seek time from one page to the next is minimal.

But, databases are smarter than that. The second boost is pre-fetching pages that are likely to be needed. Even a not-so-smart database engine can realize when it is doing a full table scan. The page manager can seek to the next page at the same time that the processor is processing data in memory. That is, the CPU is working, while the page manager spends its time waiting for new pages to load. Although the page manager is waiting, the CPU is quite busy processing other data, so there is no effective wait time.

This overlap between CPU cycles and disk is very important for database performance on large queries. And you can see it on a database machine. In a well-balanced system, the CPUs are often quite busy on a large query and the disks are less busy.

Modern RDBMS have a third capability with respect to complex queries. Much of the work is likely to take place in temporary tables. The page manager would often store these on sequential pages, and they would be optimized for sequential access. In addition, temporary tables only store the columns that they need.

In short, databases optimize their disk access in several ways. They take advantage of high-throughput disks by:
  • using large page sizes to reduce the impact of latency;
  • storing large databases on sequential pages;
  • prefetching pages while the processor works on data already in memory;
  • efficiently storing temporary tables.
At least they are doing something! By the way, the balance between latency and throughput goes back at least to the 1980s when I entered this business. And I suspect that it is a much older concern.

The advantage and disadvantage of the MapReduce approach is that it leaves such optimizations in the hands of the operating system and the programmer. Fortunately, modern computer languages are smart with respect to sequential file I/O, so reading some records and then processing them would normally be optimized.

Of course, a programmer can disrupt this by writing temporary or output files to the same disk system being used to read data. Well, actually, disks are also getting smarter. With multiple platters and multiple read heads, modern disks can support multiple seeks to different areas.

A bigger problem arises with complex algorithms. MapReduce does not provide built-in support for joining large tables. Nor even for joining smaller tables. A nested loop join in MapReduce code could kill the performance of a query. An RDBMS might implement the same join using hash tables that gracefully overflow memory, should that be necessary. An exciting development in a programmer's life is when a hash table in memory gets too big and he or she learns about operating system page faults, a concern that the database engine takes care of by itself.

As I've mentioned before, RDBMS versus MapReduce is almost a religious battle. MapReduce has capabilities that RDBMSs do not have, and not only because programming languages are more expressive than SQL. The paradigm is strong and capable for certain tasks.

On the other hand, SQL is a comparatively easy language to learn (I mean compared to programming for MapReduce) and relational databases engines often have decades of experience built into them, for partitioning data, choosing join and aggregation algorithms, building temporary tables, keeping processors busy and disks spinning, and so on. In particular, RDBMSs do know a trick or two to optimize disk performance and take advantage of modern highish-latency higher-throughput disks.

--gordon

Friday, January 25, 2008

MapReduce and SQL Aggregations

This is another post discussing the article on MapReduce written by Professors Michael Stonebraker and David DeWitt (available here).

One of the claims that they make is:

To draw an analogy to SQL, map is like the group-by clause of an aggregate query. Reduce is analogous to the aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute.

This claim is worth discussing in more detail because it is a very powerful and intuitive analogy. And, it is simply incorrect. MapReduce is much more powerful than SQL aggregations.

Another reason why I find their claim interesting is because I use the same analogy to describe MapReduce to people familiar with databases. Let me explain this in a bit more detail. Consider the following SQL query to count the number of customers who start in each month:

SELECT MONTH(c.start_date), COUNT(*)
FROM customer c
GROUP BY MONTH(c.start_date)

Now, let's consider how MapReduce would "implement" this query. Of course, MapReduce is a programming framework, not a query interpreter, but we can still use it to solve the same problem. The MapReduce framework solves this problem in the following way:

First, the map phase would read records from the customer table and produce an output record with two parts. The first part is called the "key", which is populated with MONTH(c.start_date). The second is the "value", which can be arbitrarily complicated. In this case, it is as simple as it gets. The value part simply contains "1".

The reduce phase then reads the key-value pairs, and aggregates them. The MapReduce framework sorts the data so records with the same key always occur together. This makes it easy for the reduce phase to add together all the "1"s to get the count for each key (which is the month number). The result is a count for each key.

I am intentionally oversimplified this process by describing it at a high level. The first simplification is leaving out all the C++ or Java overhead for producing the programs (although there are attempts at interpreted languages to greatly simplify this process). Another is not describing the parallel processing aspects. And yet another oversimplification is leaving out the "combine" step. The above algorithm can be made more efficient by first "reducing" the values locally on each processor to get subtotals, and then "reducing" these again. This post, however, is not about computational efficiency.

The important thing to note is the following three correspondences between MapReduce and SQL aggregates.
  1. First, MapReduce uses a "key". This key is the GROUP BY expression in a SQL aggregation statement.
  2. Second, MapReduce has a "map" function. This is the expression inside the parentheses. This can be an arbitrary function or CASE statement in SQL. In databases that support user defined functions, this can be arbitrarily complicated, as with the "map" function in MapReduce.
  3. Third, MapReduce has a "reduce" function. This is the aggregation function. In SQL, this is traditionally one of a handful of functions (SUM(), AVG(), MIN(), MAX()). However, in some databases support user defined aggregation functions, which can be arbitrarily complicated.
So, it seems that SQL and MapReduce are equivalent, particularly in an environment where SQL supports user defined functions written in an external language (such as C++, Java, or C-Sharp).

Wrong!

The next example extends the previous one by asking how many customers start and how many stop in each month. There are several ways of approaching this. The following shows one approach using a FULL OUTER JOIN:

SELECT m, ISNULL(numstarts, 0), ISNULL(numstops, 0)
FROM (SELECT MONTH(start_date) as m, COUNT(*) as numstarts
......FROM customer c
......GROUP BY MONTH(start_date)
.....) start FULL OUTER JOIN
.....(SELECT MONTH(stop_date) as m, COUNT(*) as numstops
......FROM customer c
......GROUP BY MONTH(stop_date)
.....) stop
.....ON start.m = stop.m

Another formulation might use an aggregation and UNION:

SELECT m, SUM(isstart), SUM(isstop)
FROM ((SELECT MONTH(start_date) as m, 1 as isstart, 0 as isstop
.......FROM customer c)
......UNION ALL
........(SELECT MONTH(stop_date) as m, 0 as isstart, 1 as isstop
........FROM custommer c)) a
GROUP BY m

Now, in both of these, there are two pieces of processing, one for the starts and one for the stops. In almost any databases optimizer that I can think of, both these queries (and other, similar queries) require two passes through the data, one pass for the starts and one pass for the stops. And, regardless of the optimizer, the SQL statements describe two passes through the data.

The MapReduce framework has a more efficient, and perhaps, even more intuitive solution. The map phase can produce two output keys for each record:
  • The first has a key that is MONTH(start_date) and the value is a structure containing isstart and isstop with values of 1 and 0 respectively.
  • The second has a key that is MONTH(stop_date) and the value are 0 and 1 respectively.
What is important about this example is not the details, simply the fact that the processing is quite different. The SQL methods describe two passes through the data. The MapReduce method has only one pass through the data. In short, MapReduce can be more efficient than SQL aggregations.

How much better becomes apparent when we look in more detail at what is happening. When processing data, SQL limits us to one key per record for aggregation. MapReduce does not have this limitation. We can have as many keys as we like for each record.

This difference is particularly important when analyzing complex data structures to extract features -- of which processing text from web pages is an obvious example. To take another example, one way of doing content filtering of email for spam involves looking for suspicious words in the text and then building a scoring function based on those words (naive Bayesian models would be a typical approach).

Attempting to do this in MapReduce is quite simple. The map phase looks for each word and spits out a key value pair for that word (in this case, the key is actually the email id combined with the word). The reduce phase counts them all up. Either the reduce phase or a separate program can then apply the particular scoring code. Extending such a program to include more suspicious words is pretty simple.

Attempting to do this in SQL . . . well, I wouldn't do it in the base language. It would be an obfuscated mess of CASE statements or a non-equi JOIN on a separate word table. The MapReduce approach is simpler and more elegant in this case. Ironically, I usually describe the problem as "SQL is not good in handling text." However, I think the real problem is that "SQL aggregations are limited to one key per record."

SQL has many capabilities that MapReduce does not have. Over time, MapReduce may incorporate many of the positive features of SQL for analytic capabilities (and hopefully not incorporate unneeded overhead for transactions and the like). Today, SQL remains superior for most analytic needs -- and it can take advantage of parallelism without programming. I hope SQL will also evolve, bringing together the enhanced functionality from other approaches.

--gordon