Data Miners Blog: MapReduce

Showing posts with label MapReduce. Show all posts

Wednesday, November 18, 2009

Getting Started with Hadoop and MapReduce

Over the past week, I have been teaching myself Hadoop -- a style of programming made popular by Google (who invented the parallel version of MapReduce), Yahoo! (who created the open source version called Hadoop), and Amazon (who provides cloud computing resources called EC2 for running the software).

My purpose is simple: one of our clients has a lot of web log data on Amazon S3 (which provides lots of cheap data storage). This data can be analyzed on EC2, using Hadoop. Of course, the people who put the data up there are busy running the web site. They are not analysts/data miners. Nor do they have much bandwidth to help a newbie get started. So, this has been a do-it-yourself effort.

I have some advice for anyone else who might be in a similar position. Over the past few days, I have managed to turn my Windows Vista laptop into a little Hadoop machine, where I can write code in an IDE (that means "interactive development editor", which is what you program in ), run it on a one-node hadoop cluster (that is, my laptop), and actually get results.

There are several ways to get started with Hadoop. Perhaps you work at a company that has Hadoop running on one or more clusters or has a cluster in the cloud. You can just talk to people where you are and get started.

You can also endeavor to install Hadoop yourself. There is a good chance that I will attempt this one day. However, it is supposed to be a rather complicated process. And, all the configuration and installation is a long way from data.

My preferred method is to use a Hadoop virtual machine. And I found a nice one through the Yahoo Hadoop Tutorial (and there are others . . . I would like to find one running the most recent version of Hadoop).

I have some advice, corrections and expectations for anyone else who wants to try this.

(1) Programming languages.

Hopefully you already know some programming languages, preferably of the object-oriented sort. If you are a java programmer, kudos to you, since hadoop is written in java. However, I found that I can struggle through java with my rush knowledge of C++ and C#. (This post is *not* about my complaints about java.)

For the purposes of this discussion, I do not consider SAS or SPSS to be worthy programming languages. You should be familiar with ideas such as classes, constructors, static and instance variables, functions, and class inheritance. You should also be willing to read through long run-time error messages, whose ultimate meaning is something like "I was expecting a constructor with no arguments" or "I received a Text type when I was expecting an IntWritable."

(2) Unix Familiarity

In the same way that Hadoop was developed using java, it was developed on a Unix environment. This means that you should be familiar with Unix shell commands (you know, "ls" versus "dir", "rm" versus "del", and so on). There is a rumor that a version of Hadoop will run under Windows. I am sure, though, that trying to install open source Unix-based software under Windows is going to be a major effort; so I chose the virtual machine route.

By the way, if you want to get Unix shell commands on your Windows box, then use cygwin! It provides the common Unix commands. Just download the most recent version from www.cygwin.com.

(3) Use Yahoo!'s Hadoop Tutorial (which is here)

This has proven invaluable in my efforts. Although I have a few corrections and clarifications, which I'll describe below.

(4) Be Prepared to Load Software! (And have lots of disk space)

I have loaded the following software packages on my computer, to make this work:

VMWare Player 3.0 (from VM Ware). This is free software that allows any computer to run a virtual desktop in another operating system.
Hadoop VM Appliance (from Yahoo). This has the information for running a Hadoop virtual machine.
Eclipse SDE, version 3.3.2 (from the Eclipse project archives). Note that the Tutorial suggests version 3.3.1. Version 3.3.2 works. Anything more recent is called "Ganymede", and it just doesn't work with Hadoop.
Java Development Kit from Sun.

These are all multi-hundred megabyte zipped files.

(5) Getting VMWare to Share Folders between the host and virtual machine

In order to share folders (the easiest way to pass data back and forth), you need to install VMWare Tools. VMWare has instructions here. However, it took me a while to figure out what to do. The real steps are:

(a) Under VM-->Settings go to the Hardware tab. Set the CD/DVD to be "connected" and use the ISO image file. My path is "C:\Program Files (x86)\VMware\VMware Player\linux.iso", but I believe that it appeared automatically, after a pause. This makes the virtual machine think it is reading from a CD when it is really reading from this file.

(b) Run the commands to mount and extract files
Login or su to root and then run the following:
mount /cdrom cd /tmp tar zxf /cdrom/vmware-freebsd-tools.tar.gz umount /cdrom
(c) Do the installation
I accepted all the defaults when it asked a question.

cd vmware-tools-distrib ./vmware-install.pl vmware-config-tools.pl
(d) Set up a shared folder

Go to VM --> Settings... and go to the Options tab. Enable the shared folders. Choose an appropriate folder on the host machine (I chose "c:\temp") and give it a name on the virtual machine ("temp").

(e) The folder is available on the virtual machine at /mnt/hgfs/temp.

(6) Finding the Hadoop Plug-In for Eclipse

The Tutorial has the curious instructions in the third part "In the hadoop-0.18.0/contrib/eclipse-plugin directory on this CD, you will find a file named hadoop-0.18.0-eclipse-plugin.jar. Copy this into the plugins/ subdirectory of wherever you unzipped Eclipse." These are curious because there is no CD.

You will find this directory on the Virtual Machine. Go to it. Copy the jar file to the shared folder. Then go to the host machine and copy it to the described place.

(7) The code examples may not work in the Tutorial

I believe the following code is not correct in the driver code:

   FileInputPath.addInputPath(conf, new Path("input"));
  FileOutputPath.addOutputPath(conf, new Path("output"));

The following works:

   conf.setInputPath(new Path("input"));
  conf.setOutputPath.addOutputPath(new Path("output"));

(8) Thinking the Hadoop Way

Hadoop is a different style of programming, because it is distributed. Actually, there are three different "machines" that it uses.

The first is your host machine, which is where you develop code. Although you can develop software on the Hadoop machine, the tools are better on your desktop.

The second is the Hadoop machine, which is where you can issue the commands related to Hadoop. In particular, the command "hadoop" provides access to the parallel data. This machine has a shared drive with the host machine, which you can use to move files back and forth.

The data used by the programs, though, is in a different place, the Hadoop Distributed File System (HDFS). To move data between HDFS and the virtual machine, use the "hadoop fs" command. Using the shared folder you can move it to the local machine or anywhere.

--gordon

Wednesday, April 8, 2009

MapReduce, Hadoop, Everything Old Is New Again

One of the pleasures of aging is watching younger generations discover pleasures one remembers discovering long ago--sex, porcini, the Beatles. Occasionally though, it is frustrating to see old ideas rediscovered as new ones. I am especially prone to that sort of frustration when the new idea is one I once championed unsuccessfully. Recently, I've been feeling as though I was always a Beatles fan but until recently all my friends preferred Herman's Hermits. Of course, I'm glad to see them coming around to my point of view, but still . . .

What brings these feelings up is all the excitement around MapReduce. It's nice to see a parallel programming paradigm that separates the description of the mapping from the description of the function to be applied, but at the same time, it seens a bit underwhelming. You see, I literally grew up with the parallel programming language APL. In the late 60's and early 70's my father worked at IBM's Yorktown Heights research center in the group that developped APL and I learned to program in that language at the age of 12. In 1982 I went to Analogic Corporation to work on an array processor implementation of APL. In 1986, while still at Analogic, I read Danny Hillis's book The Connection Machine and realized that he had designed the real APL Machine. I decided I wanted to work at the company that was building Danny's machine. I was hired by Guy Steele, who was then in charge of the software group at Thinking Machines. In the interview, all we talked about was APL. The more I learned about the Connection Machine's SIMD architecture, the more perfect a fit it seemed for APL or an APL-like language in which hypercubes of data may be partitioned into subcubes of any rank so that arbitrary functions can be applied to them. In APL and its descendents such as J, reduction is just one of rich family of ways that the results of applying a function to various data partitions can be glued together to form a result. I described this approach to parallel programming in a paper published in ACM SIGPLAN Notices in 1990, but as far as I know, no one ever read it. (You can, though. It is available here.) My dream of implementing APL on the Connection Machine gradually faded in the face of commercial reality. The early Connection Machine customers, having already been forced to learn Lisp, were not exactly clamouring for another esoteric language; they wanted Fortran. And Fortran is what I ended up working on. As you can tell, I still have regrets. If we'd implemented a true parallel APL back then, no one would have to invent MapReduce today.

Sunday, December 7, 2008

MapReduce and SQL Aggregations Using Grouping Sets

In an earlier post, I compared MapReduce functionality and and SQL functionality and made the claim that SQL required two passes through the data to calculate the number of customer starts and stops per month. (The data used for this is on the companion web site for my book .)

Two of the comments on this post explained SQL syntax that achieves the goal more efficiently. In particular, the GROUPING SETS keyword, which is part of more recent SQL standards, is an efficient solution that allows SQL to do more of the types of processing made possibly by MapReduce. This functionality is available in SQL Server and Oracle. However, it is not yet available in MySQL.

The following SQL query answers the question at the top of this post using FULL OUTER JOIN (an alternative approach is to use UNION ALL):

SELECT m, ISNULL(numstarts, 0), ISNULL(numstops, 0)
FROM (SELECT MONTH(start_date) as m, COUNT(*) as numstarts
......FROM customer c
......GROUP BY MONTH(start_date)
.....) start FULL OUTER JOIN
.....(SELECT MONTH(stop_date) as m, COUNT(*) as numstops
......FROM customer c
......GROUP BY MONTH(stop_date)
.....) stop
.....ON start.m = stop.m

Although this query is effective, in most databases, it would require two passes through the data.

An alternative approach is to use GROUPING SETS. This keyword is a generalization and imporvement on CUBE functionality. The generalization is more powerful, because it gives more options for the query optimizaer.

GROUPING SET allows a query to return summaries along each grouping dimension, with or without generating the full set of rows that GROUP BY would create. The following query returns a separate row for each combination of start month and stop month:

SELECT MONTH(start_date) as start_month, MONTH(stop_date) as stop_month,
.......COUNT(*) as cnt
FROM customer
GROUP BY MONTH(start_date), MONTH(stop_date)

We could imagine row in the result set as being a cell in a big cross tabulation table, with the start months on the rows and the stop months on the columns. What we really want are the subtotals along the rows and the columns, not the full table. The following query accomplishes this:

SELECT COALESCE(start_month, stop_month) as month,
.......SUM(CASE WHEN stop_month IS NULL THEN cnt ELSE 0 END) as starts,
.......SUM(CASE WHEN start_month IS NULL THEN cnt ELSE 0 END) as stops
FROM (SELECT MONTH(start_date) as start_month, MONTH(stop_date) as stop_month,
 .............COUNT(*) as cnt
......FROM customer
......GROUP BY GROUPING SETS (MONTH(start_date), MONTH(stop_date))
 .....) a
WHERE start_month IS NOT NULL and stop_month IS NOT NULL

The subquery in this query aggregates the data in a special way. The outer query simply reformats the results to be similar to the earlier query.

The GROUPING SETS keyword specifies that summaries of the data should be returned, rather than the individual aggregated rows. This syntax specifies that groups are created for the start month and stop month. So, the inner query returns rows such as the following:

Start Month	Stop Month	Count
Jan	NULL
Feb	NULL
. . .
NULL	Jan
NULL	Feb
. . .

However, the cross-rows generated by the regular group by are not there. Half the rows have the subtotals for start months; for these the stop month column is NULL. Half have subtotals for the start month, where the stop month is NULL. This syntax does not generate the cross-tabulation data, but it does keep the row and column subtotals.

The GROUPING SETS keyword generates the subtotals for the start months and stop months. In general, the query optimizer will generate the various grouping aggregations in one pass over the data. This makes the syntax and performance much more similar to the MapReduce approach.

However, Map Reduce still has two practical advantages. The first are the limits on the number of groups in the grouping sets. In SQL Server, only 32 groups are allowed. This example only used two. But more complex examples might breach this limit.

The other issue is the flexibility of the SQL language. One of the major uses of MapReduce is to process text. In this case, we would be extracting many potential features from the text, and then doing subsequent aggregations. SQL extensions can be used to create the features. However, such features quickly exceed the limits on the number of groups, limiting the feasbility of this approach.

One warning about the syntax. The parentheses in the GROUPING SETS statement are important. The following version would actually be the equivalant of the regular GROUP BY:

GROUP BY GROUPING SETS ((MONTH(start_date), MONTH(stop_date))

This is because the keyword takes a list of things being grouped. So, in the original version (one set of parentheses), there are two elements in the list -- the totals for start month and stop month. In the second version, there is one element in the list, so the cross-tabulation between start month and stop month are generated instead. This cross-product is the equivalent of the regular group by.

And my final comment is about the CUBE keyword which seems to provide the same functionality. This keyword generates all the regular aggregation rows in the table, along with additional subtotals for all combinations of dimensions. In the above example, it would generate the cross-tab table of start month and end month, as well as the summary rows.

The problem with the CUBE keyword is that the original query does not need all the aggregation rows, so generating them is a waste of time. Whether this is faster or slower than the original version of the query with the FULL OUTER JOIN depends on the environment. However, it could be quite inefficient. In addition, the query optimizer would have a very difficult time determining that these rows are not needed.

I do not feel that the CUBE keyword provides functionality similar to MapReduce. However, the GROUPING SETS keyword does provide functionality similar to MapReduce, because it produces summaries along dimensions without requiring multiple passes through the data and without generates large cross-tabulations. In addition, the GROUPING SETS keyword allows the query optimizer to choose from a variety of algorithms for executing the query, taking advantage of large computer systems using SQL syntax instead of programming.

Tuesday, August 26, 2008

MapReduce Functionality in Commercial Databases

If you use LinkedIn, then you have probably been impressed by their "People you may know" feature. I know that I have. From old friends and colleagues to an occasional person I don't necessarily want to see again, the list often contains quite familiar names.

LinkedIn is basically a large graph of connections among people, enhanced with information such as company names, date of link, and so on. We can imagine how they determine whether someone might be in someones "People you may know category", by using common names, companies, and even common paths (people who know each other).

However, trying to imagine how they might determine this information using SQL is more challenging. SQL provides the ability to store a graph of connections. However, traversing the graph is rather complicated in standard SQL. Furthermore, much of the information that LinkedIn maintains is complicated data -- names of companies, job titles, and dates, for instance.

It is not surprising, then, that they are using MapReduce to develop this information. The surprise, though, is that their data is being stored in a relational database, which provides full transactional-integrity and SQL querying capabilities. The commercial database software that supports both is provided by a company called Greenplum.

Greenplum has distringuished itself from other next-generation database vendors by incorporating MapReduce into its database engine. Basically, Greenplum developed a parallel framework for managing data, ported Postgres into this framework, and now has ported MapReduce as well. This is a strong distinction from other database vendors that provide parallel Postgres solutions, and particularly well suited to complex datatypes encountered on the web.

I do want to point out that the integration of MapReduce is at the programming level. In other words, they have not changed SQL; they have added a programming layer that allows data in the database to be readily accessed using MapReduce primitives.

As I've discussed in other posts, MapReduce and SQL are complementary technologies, each with their own strengths. MapReduce can definitely benefit from SQL functionality, since SQL has proven its ability for data storage and access. On the other hand, MapReduce has functionality that is not present in SQL databases.

Now that a database vendor has fully incorporated MapReduce into its database engine, we need to ask: Should MapReduce remain a programming paradigm or should it be incorporated into the SQL query language? What additional keywords and operators and so on are needed to enhance SQL functionality to include MapReduce?

Tuesday, April 8, 2008

Databases, MapReduce, and Disks

I just came across an interesting blog posting by Tom White entitled "Disks Have Become Tapes". This is an interesting posting, but it makes the following claim: relational databases are limited by the seek speed of disks whereas MapReduce-based methods take advantage of the streaming capabilities of disks. Hence, MapReduce is better than RDBMS for various types of processing.

Once again, I read a comment in a blog that seems misguided and gives inaccurate information. My guess is that people learn relational databases from the update/insert perspective and don't understand complex query processing. Alas. I do recommend my book for such folks. Relational databases can take advantage of high-throughput disks.

Of course, the problem is not new. Tom White quotes David DeWitt quoting Jim Gray saying "Disks are the new tapes" (here). And the numbers are impressive. It takes longer to read a high capacity disk now than it did twenty years ago, because capacity has increased much faster than transfer rates. As for random seeks on the disk, let's not go there. Seek times have hardly improved at all over this time period. Seeking on a disk is like going to Australia in a canoe -- the canoe works well enough to cross a river, so why not an ocean? And, as we all know, RDBMSs use a lot of seeks for queries so they cannot take advantage of modern disks. MapReduce to the rescue!

Wait, is that common wisdom really true?

It is true that for updating or fetching a single row, an RDBMS does use disk seeks to get there (especially if there is an index). However, this is much faster than the alternative of streaming through the whole table -- even on a fancy, multi-cheap processor MapReduce systems connected to zillions of inexpensive disks.

On a complex query, the situation is a bit more favorable to the RDBMS for several reasons. First, large analytic queries typically read entire tables (or partitions of tables). They do not "take advantage" of indexing, since they read all rows using full table scans.

However, database engines do not read rows. They read pages. Between the query processor and the data is the page manager. Or, as T. S. Elliott wrote in his poem "The Hollow Men" [on an entirely different topic]:

Between the idea
And the reality
Between the motion
And the act
Falls the shadow

In this case, the shadow is the page manager, a very important part but often overlooked component of a database management system.

Table scans read the pages assigned to a table. So, query performance is based on a balance of disk performance (both throughput and latency) and page size. For a database used for analytics, use a big page size. 4k is way small . . . 128k or even 1Mbyte could be very reasonable (and I have seen systems with even larger page sizes). Also, remember to stuff the pages full. There is no reason to partially fill pages unless the table has updates (which is superfluous for most data warehouse tables).

Databases do a lot of things to improve performance. Probably the most important boost is accidental. Large database tables are typically loaded in bulk, say once-per-day. As a result, the pages are quite likely to be allocated sequentially. Voila! In such cases, the seek time from one page to the next is minimal.

But, databases are smarter than that. The second boost is pre-fetching pages that are likely to be needed. Even a not-so-smart database engine can realize when it is doing a full table scan. The page manager can seek to the next page at the same time that the processor is processing data in memory. That is, the CPU is working, while the page manager spends its time waiting for new pages to load. Although the page manager is waiting, the CPU is quite busy processing other data, so there is no effective wait time.

This overlap between CPU cycles and disk is very important for database performance on large queries. And you can see it on a database machine. In a well-balanced system, the CPUs are often quite busy on a large query and the disks are less busy.

Modern RDBMS have a third capability with respect to complex queries. Much of the work is likely to take place in temporary tables. The page manager would often store these on sequential pages, and they would be optimized for sequential access. In addition, temporary tables only store the columns that they need.

In short, databases optimize their disk access in several ways. They take advantage of high-throughput disks by:

using large page sizes to reduce the impact of latency;
storing large databases on sequential pages;
prefetching pages while the processor works on data already in memory;
efficiently storing temporary tables.

At least they are doing something! By the way, the balance between latency and throughput goes back at least to the 1980s when I entered this business. And I suspect that it is a much older concern.

The advantage and disadvantage of the MapReduce approach is that it leaves such optimizations in the hands of the operating system and the programmer. Fortunately, modern computer languages are smart with respect to sequential file I/O, so reading some records and then processing them would normally be optimized.

Of course, a programmer can disrupt this by writing temporary or output files to the same disk system being used to read data. Well, actually, disks are also getting smarter. With multiple platters and multiple read heads, modern disks can support multiple seeks to different areas.

A bigger problem arises with complex algorithms. MapReduce does not provide built-in support for joining large tables. Nor even for joining smaller tables. A nested loop join in MapReduce code could kill the performance of a query. An RDBMS might implement the same join using hash tables that gracefully overflow memory, should that be necessary. An exciting development in a programmer's life is when a hash table in memory gets too big and he or she learns about operating system page faults, a concern that the database engine takes care of by itself.

As I've mentioned before, RDBMS versus MapReduce is almost a religious battle. MapReduce has capabilities that RDBMSs do not have, and not only because programming languages are more expressive than SQL. The paradigm is strong and capable for certain tasks.

On the other hand, SQL is a comparatively easy language to learn (I mean compared to programming for MapReduce) and relational databases engines often have decades of experience built into them, for partitioning data, choosing join and aggregation algorithms, building temporary tables, keeping processors busy and disks spinning, and so on. In particular, RDBMSs do know a trick or two to optimize disk performance and take advantage of modern highish-latency higher-throughput disks.

--gordon

Friday, January 25, 2008

MapReduce and SQL Aggregations

This is another post discussing the article on MapReduce written by Professors Michael Stonebraker and David DeWitt (available here).

One of the claims that they make is:

To draw an analogy to SQL, map is like the group-by clause of an aggregate query. Reduce is analogous to the aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute.

This claim is worth discussing in more detail because it is a very powerful and intuitive analogy. And, it is simply incorrect. MapReduce is much more powerful than SQL aggregations.

Another reason why I find their claim interesting is because I use the same analogy to describe MapReduce to people familiar with databases. Let me explain this in a bit more detail. Consider the following SQL query to count the number of customers who start in each month:


SELECT MONTH(c.start_date), COUNT(*)
FROM customer c
GROUP BY MONTH(c.start_date)

Now, let's consider how MapReduce would "implement" this query. Of course, MapReduce is a programming framework, not a query interpreter, but we can still use it to solve the same problem. The MapReduce framework solves this problem in the following way:

First, the map phase would read records from the customer table and produce an output record with two parts. The first part is called the "key", which is populated with MONTH(c.start_date). The second is the "value", which can be arbitrarily complicated. In this case, it is as simple as it gets. The value part simply contains "1".

The reduce phase then reads the key-value pairs, and aggregates them. The MapReduce framework sorts the data so records with the same key always occur together. This makes it easy for the reduce phase to add together all the "1"s to get the count for each key (which is the month number). The result is a count for each key.

I am intentionally oversimplified this process by describing it at a high level. The first simplification is leaving out all the C++ or Java overhead for producing the programs (although there are attempts at interpreted languages to greatly simplify this process). Another is not describing the parallel processing aspects. And yet another oversimplification is leaving out the "combine" step. The above algorithm can be made more efficient by first "reducing" the values locally on each processor to get subtotals, and then "reducing" these again. This post, however, is not about computational efficiency.

The important thing to note is the following three correspondences between MapReduce and SQL aggregates.

First, MapReduce uses a "key". This key is the GROUP BY expression in a SQL aggregation statement.
Second, MapReduce has a "map" function. This is the expression inside the parentheses. This can be an arbitrary function or CASE statement in SQL. In databases that support user defined functions, this can be arbitrarily complicated, as with the "map" function in MapReduce.
Third, MapReduce has a "reduce" function. This is the aggregation function. In SQL, this is traditionally one of a handful of functions (SUM(), AVG(), MIN(), MAX()). However, in some databases support user defined aggregation functions, which can be arbitrarily complicated.

So, it seems that SQL and MapReduce are equivalent, particularly in an environment where SQL supports user defined functions written in an external language (such as C++, Java, or C-Sharp).

Wrong!

The next example extends the previous one by asking how many customers start and how many stop in each month. There are several ways of approaching this. The following shows one approach using a FULL OUTER JOIN:


SELECT m, ISNULL(numstarts, 0), ISNULL(numstops, 0)
FROM (SELECT MONTH(start_date) as m, COUNT(*) as numstarts
......FROM customer c
......GROUP BY MONTH(start_date)
.....) start FULL OUTER JOIN
.....(SELECT MONTH(stop_date) as m, COUNT(*) as numstops
......FROM customer c
......GROUP BY MONTH(stop_date)
.....) stop
.....ON start.m = stop.m

Another formulation might use an aggregation and UNION:


SELECT m, SUM(isstart), SUM(isstop)
FROM ((SELECT MONTH(start_date) as m, 1 as isstart, 0 as isstop
.......FROM customer c)
......UNION ALL
........(SELECT MONTH(stop_date) as m, 0 as isstart, 1 as isstop
........FROM custommer c)) a
GROUP BY m

Now, in both of these, there are two pieces of processing, one for the starts and one for the stops. In almost any databases optimizer that I can think of, both these queries (and other, similar queries) require two passes through the data, one pass for the starts and one pass for the stops. And, regardless of the optimizer, the SQL statements describe two passes through the data.

The MapReduce framework has a more efficient, and perhaps, even more intuitive solution. The map phase can produce two output keys for each record:

The first has a key that is MONTH(start_date) and the value is a structure containing isstart and isstop with values of 1 and 0 respectively.
The second has a key that is MONTH(stop_date) and the value are 0 and 1 respectively.

What is important about this example is not the details, simply the fact that the processing is quite different. The SQL methods describe two passes through the data. The MapReduce method has only one pass through the data. In short, MapReduce can be more efficient than SQL aggregations.

How much better becomes apparent when we look in more detail at what is happening. When processing data, SQL limits us to one key per record for aggregation. MapReduce does not have this limitation. We can have as many keys as we like for each record.

This difference is particularly important when analyzing complex data structures to extract features -- of which processing text from web pages is an obvious example. To take another example, one way of doing content filtering of email for spam involves looking for suspicious words in the text and then building a scoring function based on those words (naive Bayesian models would be a typical approach).

Attempting to do this in MapReduce is quite simple. The map phase looks for each word and spits out a key value pair for that word (in this case, the key is actually the email id combined with the word). The reduce phase counts them all up. Either the reduce phase or a separate program can then apply the particular scoring code. Extending such a program to include more suspicious words is pretty simple.

Attempting to do this in SQL . . . well, I wouldn't do it in the base language. It would be an obfuscated mess of CASE statements or a non-equi JOIN on a separate word table. The MapReduce approach is simpler and more elegant in this case. Ironically, I usually describe the problem as "SQL is not good in handling text." However, I think the real problem is that "SQL aggregations are limited to one key per record."

SQL has many capabilities that MapReduce does not have. Over time, MapReduce may incorporate many of the positive features of SQL for analytic capabilities (and hopefully not incorporate unneeded overhead for transactions and the like). Today, SQL remains superior for most analytic needs -- and it can take advantage of parallelism without programming. I hope SQL will also evolve, bringing together the enhanced functionality from other approaches.

--gordon

Data Miners Blog

Wednesday, November 18, 2009

Getting Started with Hadoop and MapReduce

Wednesday, April 8, 2009

MapReduce, Hadoop, Everything Old Is New Again

Sunday, December 7, 2008

MapReduce and SQL Aggregations Using Grouping Sets

Tuesday, August 26, 2008

MapReduce Functionality in Commercial Databases

Tuesday, April 8, 2008

Databases, MapReduce, and Disks

Friday, January 25, 2008

MapReduce and SQL Aggregations

Search This Blog

Topic Index

Blog Archive

Contributors

Sites we love

Data Miners Blog

Wednesday, November 18, 2009

Getting Started with Hadoop and MapReduce

Wednesday, April 8, 2009

MapReduce, Hadoop, Everything Old Is New Again

Sunday, December 7, 2008

MapReduce and SQL Aggregations Using Grouping Sets

Tuesday, August 26, 2008

MapReduce Functionality in Commercial Databases

Tuesday, April 8, 2008

Databases, MapReduce, and Disks

Friday, January 25, 2008

MapReduce and SQL Aggregations

Search This Blog

Topic Index

Blog Archive

Contributors

Subscribe To

Sites we love