Tuesday, January 5, 2010

MapReduce versus Relational Databases?

The current issue of Communications of the ACM has articles on MapReduce and relational databases. One, MapReduce a Flexible Data Processing Tool, explains the utility of MapReduce by two Google fellows -- appropriate authors, since Google invented the parallel MapReduce paradigm.

The second article, MapReduce and Parallel DBMSs: Friend or Foe, is written by a team of authors, with Michael Stonebraker listed as the first author. I am uncomfortable with this article, because the article purports to show the superiority of a particular database system, Vertica, without mentioning -- anywhere -- that Michael Stonebraker is listed as the CTO and Co-Founder on Vertica's web site. For this reason, I believe that this article should be subject to much more scrutiny.

Before starting, let me state that I personally have no major relationships with any of the database vendors or with companies in the Hadoop/MapReduce space. I am an advocate of using relational databases for data analysis and have written a book called Data Analysis Using SQL and Excel. And, over the past three months, I have been learning Hadoop and MapReduce, as attested to by numerous blog postings on the subject. Perhaps because I am a graduate of MIT ('85), I am upset that Michael Stonebraker uses his MIT affiliation for this article, without mentioning his Vertica affiliation.

The first thing I notice about the article is the number of references to Vertica. In the main text, I count nine references to Vertica, as compared to thirteen mentions of other databases:
  • Aster (twice)
  • DataAllegro (once)
  • DB2 (twice)
  • Greenplum (twice)
  • Netezza (once)
  • ParAccel (once)
  • PostgreSQL (once)
  • SQL Server (once)
  • Teradata (once)
The paper describes a study which compares Vertica, another database, and Hadoop on various tasks. The paper never explains how these databases were chosen for this purpose. Configuration issues for the other database and Hadoop are mentioned. The configuration and installation of Vertica -- by the absence of problems -- one assumes is easy and smooth. I have not (yet) read the paper cited, which describes the work in more detail.

Also, the paper never describes costs for the different system, which is a primary driver of MapReduce. The software is free and runs on cheap clusters of computers, rather than expensive servers and hardware. For a given amount of money, MapReduce may provide a much faster solution, since it can support much larger hardware environments.

The paper never describes issues in the loading of data. I assume this is a significant cost for the databases. Loading the data for Hadoop is much simpler . . . since it just reads text files, which is a common format.

From what I can gather, the database systems were optimized specifically for the tasks at hand, although this is not explicitly mentioned anywhere. For instance, the second tasks is a GROUP BY, and I suspect that the data is hash partitioned by the GROUP BY clause.

There are a few statements that I basically disagree with.

"Lastly, the reshuffle that occurs between the Map and Reduce tasks in MR is equivalent to a GROUP BY operation in SQL." The issue here at first seems like a technicality. In a relational database, an input row can only into one group. MR can output multiple records in the map stage, so a single row can go into multiple "groups". This functionality is important for the word count example, which is the canonical MapReduce example. I find it interesting that this example is not included in the benchmark.

"Given this, parallel DBMSs provide the same computing model as MR, with the added benefit of using a declarative language (SQL)." This is not true in several respects. First, MapReduce does have associated projects for supporting declarative languages. Second, in order for SQL to support the level of functionality that the authors claim, they need to use user defined functions. Is that syntax declarative?

More importantly, though, is that the computing model really is not exactly the same. Well, with SQL extensions such as GROUPING SETs and window functions, the functionality does come close. But, consider the ways that you can add a row number to data (assuming that you have no row number function built-in) using MapReduce versus traditional SQL. Using MapReduce you can follow the two-phase program that I described in an earlier posting. With traditional SQL, you have to do a non-equi-self join. MapReduce has a much richer set of built-in functions and capabilities, simply because it uses java, an established programming language with many libraries.

On the other hand, MapReduce does not have a concept of "null" built-in (although users can define their own data types and semantics). And, MapReduce handles non-equijoins poorly, because the key is used to direct both tables to the same node. In effect, you have to limit the MapReduce job to one node. SQL can still parallelize such queries.

"[MapReduce] still requires user code to parse the value portion of the record if it contains multiple attributes." Well, parse is the wrong term, since a Writable class supports binary representations of data types. I describe how to create such types here.

I don't actually feel qualified to comment on many of the operational aspects of optimizing Hadoop code. I do note that the authors do not explain the main benefit of Vertica, which is the support of column partitioning. Each column is stored separate, which makes it possible to apply very strong compression algorithms to the data. In many cases, the Vertica data will fit in memory. This is a huge performance boost (and one that another vendor, Paracel takes advantage of).

In the end, the benchmark may be comparing the in-memory performance of a database to general performance for MapReduce. The benchmark may not be including the ETL time for loading the data, partitioning data, and building indexes. The benchmark may not have allocated optimal numbers of map and reduce jobs for the purpose. And, it is possible that the benchmark is unbiased and relational databases really are better.

A paper that leaves out the affiliations between its authors and the vendors used for a benchmark is only going to invite suspicion.

Labels: , ,

Sunday, October 19, 2008

Rolling and Unrolling Correlated Subqueries in SQL

The subject of correlated subqueries arose recently in a data mining class I was teaching. A student inquired about improving the performance of a particular query, which happened to have a correlated subquery. This posting discusses unrolling correlated subqueries to improve performance as well as the rarer need to use correlated subqueries to increase performance.

Correlated subqueries are SQL queries that contain a nested subquery, where the nested query refers to one or more outside tables. The definition sounds complicated, but an example is worth a thousand words.

My book Data Analysis Using SQL and Excel includes a database of customers, orders, and transactions (which can be downloaded). From such data, we might ask a question such as "What products did customer X order on her or his earliest order date?" A typical way to answer this is with a corrrelated subquery.

SELECT ol.ProductID
FROM orders o JOIN
.....orderline ol
.....ON o.OrderID = ol.OrderID AND
.....o.CustomerID = X
WHERE o.OrderDate = (SELECT MIN(OrderDate)
.....................FROM orders o2
.....................WHERE o2.CustomerID = o.CustomerID)


Since this is standard SQL, all reasonable relational databases should support this syntax. One syntax note: the subquery could optionally contain a "GROUP BY o2.CustomerID" clause.

What is the query doing? It is joining two tables together (orders and orderline) and then restricting the results to a single customer. However, the query is about the products in a particular order, so the WHERE clause selects the particular order -- as the one with the smallest OrderDate. Voila. The query answers the question.

The correlated subquery is in the WHERE clause, buried in the subquery in the line o.OrderID = o2.OrderID. This is placing a restriction on the values in the subquery based on the results of an outer query. Do note that if the WHERE clause were instead o.CustomerID = , then the subquery would not be correlated, since there would be no connection to the outer tables.

So far so good. When we think of how the query runs, we think of iterating through every row in the o2 table and looking to match it to the current value in the o table. If there is an index, so much the better because the query engine can use the index to access the o2 table.

This conceptual approach is, in fact, how most (if not all) query engines optimize such a query. For now, I'm leaving open the question of whether this is a good thing, in order to present the idea of unrolling the subquery.

There are other ways to answer the original question ("What products did Customer X order on his or her earliest order date?"). The following query shows an alternative approach:

SELECT ProductID
FROM orders o JOIN
.....orderlines ol
.....ON o.OrderID = ol.OrderID JOIN
.....(SELECT CustomerID, MIN(OrderDate) as minOrderDate
......FROM orders
......GROUP BY CustomerID) omin

.....ON o.OrderDate = omin.minOrderDate AND
........o.CustomerID = omin.CustomerID
WHERE o.CustomerID = X

This version of the query unrolls the subquery, by creating a summary table with the earliest order date for all customers. The link to the other table is made through an explicit join condition between this summary table and the orders table.

Note that in this particular query, the WHERE clause that chooses the customer could be in the subquery, because the columns in the WHERE clause are in the subquery. However, in the general case, the filter could be using columns not available in the subquery -- such as getting all products that start with the letter "A".

There is a big difference in how this query gets executed versus the earlier version. The big difference is that now the orders need to be grouped to find the earliest order date for all orders. The correlated subquery could use an index and only look at the handful of rows for a given customer. So, the correlated subquery seems to be more efficient.

If the correlated subquery is more efficient, then why do I personally avoid using them? One reason is the explicitness of the joins. I find it much easier to understand the unrolled version. However, ease of understanding is less important than performance. In many cases, the unrolled version does execute faster.

Notice that both these queries are looking for data about one particular customer -- a small subset of the overall data. For queries that are looking for such needles in the haystack, then correlated subqueries are fine.

However, decision support queries are usually looking to sift through the whole haystack and not look for just the needle. If we changed the question to "What products are ordered on the earliest order date?" then the queries lose the restrictive clause limiting them to one customer. Now what happens?

In the case of the correlated subquery, query engines essentially execute the joins in one of two ways: (1) by repeatedly looping through one table (typically the one in the inner join) or (2) using indexes. In terms of join algorithms, these are nested loop joins and index-based joins -- two perfectly good join algorithms. But, I might add, two out of many algorithms that could be used.

On the other hand, doing the explicit join as in the second example allows the query engine to execute the different steps it needs to execute, and then to decide on the best strategies. In particular, when the data is partitioned for simultaneous access on multiple processors, most query engines would forget the parallel possibilities and simply execute the correlated subquery on a single processor.

On the other hand, most parallel query engines would correctly parallelize the second version of the query. The GROUP BY would execute in parallel, as would the rest of the joins. The query optimizer would use table statistics to generate the best query plan.

Correlated subqueries are a tool used when designing queries. In all cases, though, the subqueries can be unrolled using more traditional aggregation and join operations. However, query optimizers generally do not perform this operation.

Correlated subqueries are often the most efficient approach when looking for a few rows from a table, particularly when the optimizer can use indexes for the join. On the other hand, unrolling the subqueries is often more efficient when there is a large amount of data, because the optimizer can do full query optimization, making use of parallelism and table statistics.

Currently, most query optimizers do not know how to unrolls correlated subqueries -- or how to roll them back up. So, we need to make such decisions when writing the queries ourselves.

Labels: ,

Tuesday, August 26, 2008

MapReduce Functionality in Commercial Databases

If you use LinkedIn, then you have probably been impressed by their "People you may know" feature. I know that I have. From old friends and colleagues to an occasional person I don't necessarily want to see again, the list often contains quite familiar names.

LinkedIn is basically a large graph of connections among people, enhanced with information such as company names, date of link, and so on. We can imagine how they determine whether someone might be in someones "People you may know category", by using common names, companies, and even common paths (people who know each other).

However, trying to imagine how they might determine this information using SQL is more challenging. SQL provides the ability to store a graph of connections. However, traversing the graph is rather complicated in standard SQL. Furthermore, much of the information that LinkedIn maintains is complicated data -- names of companies, job titles, and dates, for instance.

It is not surprising, then, that they are using MapReduce to develop this information. The surprise, though, is that their data is being stored in a relational database, which provides full transactional-integrity and SQL querying capabilities. The commercial database software that supports both is provided by a company called Greenplum.

Greenplum has distringuished itself from other next-generation database vendors by incorporating MapReduce into its database engine. Basically, Greenplum developed a parallel framework for managing data, ported Postgres into this framework, and now has ported MapReduce as well. This is a strong distinction from other database vendors that provide parallel Postgres solutions, and particularly well suited to complex datatypes encountered on the web.

I do want to point out that the integration of MapReduce is at the programming level. In other words, they have not changed SQL; they have added a programming layer that allows data in the database to be readily accessed using MapReduce primitives.

As I've discussed in other posts, MapReduce and SQL are complementary technologies, each with their own strengths. MapReduce can definitely benefit from SQL functionality, since SQL has proven its ability for data storage and access. On the other hand, MapReduce has functionality that is not present in SQL databases.

Now that a database vendor has fully incorporated MapReduce into its database engine, we need to ask: Should MapReduce remain a programming paradigm or should it be incorporated into the SQL query language? What additional keywords and operators and so on are needed to enhance SQL functionality to include MapReduce?

Labels: , , ,

Tuesday, July 29, 2008

Nested Subqueries in SQL

A recent question:

You used a lot of multi-layer subqueries. Equivalently, I think we can create intermediate tables or views and query them. It's easier for me to build, to follow, and to debugg especially from dataflow diagram. But I do believe the two approaches will result in different time and space required. Could you elaborate on the difference?

[Note: This question is about the the book Data Analysis Using SQL and Excel.]
This is a good question. Ironically, I received the email while I was sitting in an office in Kuala Lumpur, Malaysia writing very complicated nested queries running a very remote database (the database is actually in San Diego).

In this case, the need for complicated nested subqueries was obvious: I (and the group I was working with) only have read access into the database. And for good reason. Although the database contains key analytic information, it is an operational database. In such cases, analysts often have only read access. (By the way, this brings up a question for Oracle: Why can't a user with read-only access explain a query?)

This very immediate experience provides the first answer to the question. In some circumstances, it is not possible or desireable to write to the database.

However, that is only the first reason. There are other reasons.

One of the jobs of databases is planning the most efficient execution plan. For instance, a query might join several tables together. If we do these joins with intermediate tables, say, two at a time, then we impose an ordering on them. However, one of the most important parts of a SQL optimizer is the part that chooses the ordering of joins. Some ways of doing joins are more efficient than other ways.

So, a second reason is that explicitly storing results in intermediate tables might prevent the SQL optimizer from choosing the most efficient query plan.

The third reason also has to do with the database engine. Database engines manage storage. A complex query plan may produce several intermediate results in temporary tables. The writer of the query does not need to name these tables, keep track of the storage, or remember to delete them. The query engine does this automatically.

Doing analysis in a single query (versus in a script file) saves time and effort in storage management.

Storing data in intermediate tables may also impose constraints on the intermediate tables. In particular, the intermediate tables may not be parallel or the table space used for storage may be inefficient in some other way. Using subqueries eliminates any dependency on possible inefficient temporary user storage.

Another reason has to do with columns in intermediate results that are not used. Eliminating columns in query processing can be significant for improving efficiency of queries. When storing results in an intermediate table, all the columns are stored. When using subqueries, the query plan should include only columns actually needed.

Subqueries allow the query plan to eliminate unused columns from intermediate tables.

A final reason for me is really a personal preference. The task of maintaining separate queries and tables, especially using naming conventions is cumbersome. If I store results in a temporary table, I want the name to mean something. If I'm using a subquery, then the name is less important. Or, if I change the name or type of a column, then I find it easier to make the change in one place rather than distributed through a script file. Using a subquery reduces my need to think of new names. Admittedly, this reason is really personal preference.

In general, writing complicated subqueries actually allows the query optimizer to do what it does best -- determine the most efficient execution plan for running the query. Although there are some exceptions, the general rule is quite simple: try to focus on the questions being answered and not how the results are combined. SQL may not be a perfect language for processing, but it does allow us to do very complex data manipulations with a minimum of programming concerns.

Labels: , ,

Tuesday, April 8, 2008

Databases, MapReduce, and Disks

I just came across an interesting blog posting by Tom White entitled "Disks Have Become Tapes". This is an interesting posting, but it makes the following claim: relational databases are limited by the seek speed of disks whereas MapReduce-based methods take advantage of the streaming capabilities of disks. Hence, MapReduce is better than RDBMS for various types of processing.

Once again, I read a comment in a blog that seems misguided and gives inaccurate information. My guess is that people learn relational databases from the update/insert perspective and don't understand complex query processing. Alas. I do recommend my book Data Analysis Using SQL and Excel for such folks. Relational databases can take advantage of high-throughput disks.

Of course, the problem is not new. Tom White quotes David DeWitt quoting Jim Gray saying "Disks are the new tapes" (here). And the numbers are impressive. It takes longer to read a high capacity disk now than it did twenty years ago, because capacity has increased much faster than transfer rates. As for random seeks on the disk, let's not go there. Seek times have hardly improved at all over this time period. Seeking on a disk is like going to Australia in a canoe -- the canoe works well enough to cross a river, so why not an ocean? And, as we all know, RDBMSs use a lot of seeks for queries so they cannot take advantage of modern disks. MapReduce to the rescue!

Wait, is that common wisdom really true?

It is true that for updating or fetching a single row, an RDBMS does use disk seeks to get there (especially if there is an index). However, this is much faster than the alternative of streaming through the whole table -- even on a fancy, multi-cheap processor MapReduce systems connected to zillions of inexpensive disks.

On a complex query, the situation is a bit more favorable to the RDBMS for several reasons. First, large analytic queries typically read entire tables (or partitions of tables). They do not "take advantage" of indexing, since they read all rows using full table scans.

However, database engines do not read rows. They read pages. Between the query processor and the data is the page manager. Or, as T. S. Elliott wrote in his poem "The Hollow Men" [on an entirely different topic]:

Between the idea
And the reality
Between the motion
And the act
Falls the shadow

In this case, the shadow is the page manager, a very important part but often overlooked component of a database management system.

Table scans read the pages assigned to a table. So, query performance is based on a balance of disk performance (both throughput and latency) and page size. For a database used for analytics, use a big page size. 4k is way small . . . 128k or even 1Mbyte could be very reasonable (and I have seen systems with even larger page sizes). Also, remember to stuff the pages full. There is no reason to partially fill pages unless the table has updates (which is superfluous for most data warehouse tables).

Databases do a lot of things to improve performance. Probably the most important boost is accidental. Large database tables are typically loaded in bulk, say once-per-day. As a result, the pages are quite likely to be allocated sequentially. Voila! In such cases, the seek time from one page to the next is minimal.

But, databases are smarter than that. The second boost is pre-fetching pages that are likely to be needed. Even a not-so-smart database engine can realize when it is doing a full table scan. The page manager can seek to the next page at the same time that the processor is processing data in memory. That is, the CPU is working, while the page manager spends its time waiting for new pages to load. Although the page manager is waiting, the CPU is quite busy processing other data, so there is no effective wait time.

This overlap between CPU cycles and disk is very important for database performance on large queries. And you can see it on a database machine. In a well-balanced system, the CPUs are often quite busy on a large query and the disks are less busy.

Modern RDBMS have a third capability with respect to complex queries. Much of the work is likely to take place in temporary tables. The page manager would often store these on sequential pages, and they would be optimized for sequential access. In addition, temporary tables only store the columns that they need.

In short, databases optimize their disk access in several ways. They take advantage of high-throughput disks by:
  • using large page sizes to reduce the impact of latency;
  • storing large databases on sequential pages;
  • prefetching pages while the processor works on data already in memory;
  • efficiently storing temporary tables.
At least they are doing something! By the way, the balance between latency and throughput goes back at least to the 1980s when I entered this business. And I suspect that it is a much older concern.

The advantage and disadvantage of the MapReduce approach is that it leaves such optimizations in the hands of the operating system and the programmer. Fortunately, modern computer languages are smart with respect to sequential file I/O, so reading some records and then processing them would normally be optimized.

Of course, a programmer can disrupt this by writing temporary or output files to the same disk system being used to read data. Well, actually, disks are also getting smarter. With multiple platters and multiple read heads, modern disks can support multiple seeks to different areas.

A bigger problem arises with complex algorithms. MapReduce does not provide built-in support for joining large tables. Nor even for joining smaller tables. A nested loop join in MapReduce code could kill the performance of a query. An RDBMS might implement the same join using hash tables that gracefully overflow memory, should that be necessary. An exciting development in a programmer's life is when a hash table in memory gets too big and he or she learns about operating system page faults, a concern that the database engine takes care of by itself.

As I've mentioned before, RDBMS versus MapReduce is almost a religious battle. MapReduce has capabilities that RDBMSs do not have, and not only because programming languages are more expressive than SQL. The paradigm is strong and capable for certain tasks.

On the other hand, SQL is a comparatively easy language to learn (I mean compared to programming for MapReduce) and relational databases engines often have decades of experience built into them, for partitioning data, choosing join and aggregation algorithms, building temporary tables, keeping processors busy and disks spinning, and so on. In particular, RDBMSs do know a trick or two to optimize disk performance and take advantage of modern highish-latency higher-throughput disks.

--gordon

Labels: ,

Friday, January 25, 2008

MapReduce and SQL Aggregations

This is another post discussing the article on MapReduce written by Professors Michael Stonebraker and David DeWitt (available here).

One of the claims that they make is:

To draw an analogy to SQL, map is like the group-by clause of an aggregate query. Reduce is analogous to the aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute.

This claim is worth discussing in more detail because it is a very powerful and intuitive analogy. And, it is simply incorrect. MapReduce is much more powerful than SQL aggregations.

Another reason why I find their claim interesting is because I use the same analogy to describe MapReduce to people familiar with databases. Let me explain this in a bit more detail. Consider the following SQL query to count the number of customers who start in each month:

SELECT MONTH(c.start_date), COUNT(*)
FROM customer c
GROUP BY MONTH(c.start_date)

Now, let's consider how MapReduce would "implement" this query. Of course, MapReduce is a programming framework, not a query interpreter, but we can still use it to solve the same problem. The MapReduce framework solves this problem in the following way:

First, the map phase would read records from the customer table and produce an output record with two parts. The first part is called the "key", which is populated with MONTH(c.start_date). The second is the "value", which can be arbitrarily complicated. In this case, it is as simple as it gets. The value part simply contains "1".

The reduce phase then reads the key-value pairs, and aggregates them. The MapReduce framework sorts the data so records with the same key always occur together. This makes it easy for the reduce phase to add together all the "1"s to get the count for each key (which is the month number). The result is a count for each key.

I am intentionally oversimplified this process by describing it at a high level. The first simplification is leaving out all the C++ or Java overhead for producing the programs (although there are attempts at interpreted languages to greatly simplify this process). Another is not describing the parallel processing aspects. And yet another oversimplification is leaving out the "combine" step. The above algorithm can be made more efficient by first "reducing" the values locally on each processor to get subtotals, and then "reducing" these again. This post, however, is not about computational efficiency.

The important thing to note is the following three correspondences between MapReduce and SQL aggregates.
  1. First, MapReduce uses a "key". This key is the GROUP BY expression in a SQL aggregation statement.
  2. Second, MapReduce has a "map" function. This is the expression inside the parentheses. This can be an arbitrary function or CASE statement in SQL. In databases that support user defined functions, this can be arbitrarily complicated, as with the "map" function in MapReduce.
  3. Third, MapReduce has a "reduce" function. This is the aggregation function. In SQL, this is traditionally one of a handful of functions (SUM(), AVG(), MIN(), MAX()). However, in some databases support user defined aggregation functions, which can be arbitrarily complicated.
So, it seems that SQL and MapReduce are equivalent, particularly in an environment where SQL supports user defined functions written in an external language (such as C++, Java, or C-Sharp).

Wrong!

The next example extends the previous one by asking how many customers start and how many stop in each month. There are several ways of approaching this. The following shows one approach using a FULL OUTER JOIN:

SELECT m, ISNULL(numstarts, 0), ISNULL(numstops, 0)
FROM (SELECT MONTH(start_date) as m, COUNT(*) as numstarts
......FROM customer c
......GROUP BY MONTH(start_date)
.....) start FULL OUTER JOIN
.....(SELECT MONTH(stop_date) as m, COUNT(*) as numstops
......FROM customer c
......GROUP BY MONTH(stop_date)
.....) stop
.....ON start.m = stop.m

Another formulation might use an aggregation and UNION:

SELECT m, SUM(isstart), SUM(isstop)
FROM ((SELECT MONTH(start_date) as m, 1 as isstart, 0 as isstop
.......FROM customer c)
......UNION ALL
........(SELECT MONTH(stop_date) as m, 0 as isstart, 1 as isstop
........FROM custommer c)) a
GROUP BY m

Now, in both of these, there are two pieces of processing, one for the starts and one for the stops. In almost any databases optimizer that I can think of, both these queries (and other, similar queries) require two passes through the data, one pass for the starts and one pass for the stops. And, regardless of the optimizer, the SQL statements describe two passes through the data.

The MapReduce framework has a more efficient, and perhaps, even more intuitive solution. The map phase can produce two output keys for each record:
  • The first has a key that is MONTH(start_date) and the value is a structure containing isstart and isstop with values of 1 and 0 respectively.
  • The second has a key that is MONTH(stop_date) and the value are 0 and 1 respectively.
What is important about this example is not the details, simply the fact that the processing is quite different. The SQL methods describe two passes through the data. The MapReduce method has only one pass through the data. In short, MapReduce can be more efficient than SQL aggregations.

How much better becomes apparent when we look in more detail at what is happening. When processing data, SQL limits us to one key per record for aggregation. MapReduce does not have this limitation. We can have as many keys as we like for each record.

This difference is particularly important when analyzing complex data structures to extract features -- of which processing text from web pages is an obvious example. To take another example, one way of doing content filtering of email for spam involves looking for suspicious words in the text and then building a scoring function based on those words (naive Bayesian models would be a typical approach).

Attempting to do this in MapReduce is quite simple. The map phase looks for each word and spits out a key value pair for that word (in this case, the key is actually the email id combined with the word). The reduce phase counts them all up. Either the reduce phase or a separate program can then apply the particular scoring code. Extending such a program to include more suspicious words is pretty simple.

Attempting to do this in SQL . . . well, I wouldn't do it in the base language. It would be an obfuscated mess of CASE statements or a non-equi JOIN on a separate word table. The MapReduce approach is simpler and more elegant in this case. Ironically, I usually describe the problem as "SQL is not good in handling text." However, I think the real problem is that "SQL aggregations are limited to one key per record."

SQL has many capabilities that MapReduce does not have. Over time, MapReduce may incorporate many of the positive features of SQL for analytic capabilities (and hopefully not incorporate unneeded overhead for transactions and the like). Today, SQL remains superior for most analytic needs -- and it can take advantage of parallelism without programming. I hope SQL will also evolve, bringing together the enhanced functionality from other approaches.

--gordon

Labels: ,