Tuesday, September 30, 2008

A question about decision trees

Hi,
In your experience with decision trees, do you prefer to use a small set of core variables in order to make the model more elegant and/or understandable? At what point do you feel a tree has grown too large and complicated? What are the indicators that typically tell you that you need to do some pruning?
Thank you!
-Adam


Elegance and ease of understanding may or may not be important depending on your model's intended purpose. There are certainly times when it is important to come up with a small set of simple rules. In our book Mastering Data Mining we give an example of a decision tree model used to produce rules that were printed on a poster next to a printing press so the press operators could avoid a particular printing defect. When a decision tree is used for customer segmentation, it is unlikely that your marketing department is equipped to handle more than a handful of segments and the segments should be described in terms of a few famiar variables. In both of these cases, the decision tree is meant to be descriptive.

On the other hand, many (I would guess most) decision trees are not intended as descriptions; they are intended to produce scores of some kind. If the point of the model is to give each prospect a probability of response, then I see no reason to be concerned about having hundreds or even thousands of leaves so long as each one receives sufficient training records that the proportion of responders at the leaf is a statistically confident estimate of the response probability. A very nice feature of decision tree models is that one need not grok the entire tree in order to interpret any particular rule it generates. Even in a very complex tree, the path from the root to a particular leaf of interest gives a fairly simple description of records contained in that leaf.

For trees used to estimate some continuous quantity, an abundance of leaves is very desirable. As estimators, regression trees have the desirable quality of never making truly unreasonable estimates (as a linear regression, for example, might do) because every estimate is an average of a large number of actual observed values. The downside is that it cannot produce any more distinct values than it has leaves. So, the more leaves the better.

The need for pruning usually arises when leaves are allowed to become too small. This leads to splits that are not statistically significant. Apply each split rule to your training set and a validation set drawn from the same population. You should see the same distribution of target classes in both training and validation data. If you do not, your model has overfit the training data. Many software tools have absurdly low default minimum leaf sizes--probably because they were developed on toy datasets such as the ubiquitous irises. I routinely set the minimum leaf size to something like 500 so overfitting is not an issue and pruning is unnecesary.

I have focused on the number of leaves rather than the number of variables since I think that is a better measure of tree complexity. You actually asked about the number of variables. I recommend a two-stage approach. In the first stage, do not worry about how many variables there are or which variable from each family of related variables gets picked by the model. One of the great uses for a decision tree is to pick a small subset of useful variables out of hundreds or thousands of candidates. At a later stage, look at the variables that were picked and think about what concept each of them is getting at. Then pick a set of variables that express those concepts neatly and perhaps even elegantly. You might find, for example, that the customer ID is a good predictor and appears in many rules because customer IDs were assigned serially and long-time customers behave differently than recent customers. Even though this makes perfect sense, it would be hard to explain so you would replace it with a more transparent indication of customer tenure such as "months since first purchase."

Monday, September 29, 2008

Three Questions

Hi Gordon & Michael,

I have a few questions, hope you can help me!

1. While modeling, if we don’t have a very specific client requirement, at what accuracy should we usually stop? Should we stop at 75%, or 80%? Are there standard accuracy requirements based on the industry? For example, in drug research & development, model accuracy is required to be very high.

2. What is the best approach for selecting records/training dataset when the client doesn’t have info on the cut-off/valid ranges for certain numeric columns? If it’s something like Age, there is no problem. But when it’s client/business specific columns, it’s not that easy to figure out the valid ranges. What I usually do for such problems is - 1. do some research on the web to have an understanding on all the values that the specific column can take 2. see the data distribution of that column and select values based on the percentiles. E.g if values from 10 to 60 (for that column) represent 80% of all the records, I exclude all records having values outside this range. Is this a good approach? Are there other alternatives?

3. Generally, I see model accuracy (predictive/risk/churn models) getting better when I recode/transform continuous variables into categorical variables through binning/grouping. But this also results in loss of information. How do we strike a balance here? I believe the business/domain should only decide whether I should use continuous or categorical values, and not the statistics. Is that correct?

Will check your blog regularly for the answersJ

Thanks,

Romakanta Irungbam


These three questions have something in common: There is no single right answer since so much depends on the business context (in the first two cases) or the modeling context (in the third case).

First Question

My statement about no right answers is especially true of the question regarding accuracy. There are contexts where a 95% error rate is perfectly acceptable. I am thinking of response modeling for direct mail. If a model is used to choose people likely to respond to an offer and only 5% of those chosen actually respond, then the error rate is 95%. How could that be acceptable? Well, if a 4% response rate is required for profitability and the response rate for a randomly selected control group is 3% then the model--despite its apparently terrible error rate--has heroically turned a money-losing campaign into a profitable one. Success is measured in dollars (or rupees or yen, but you know what I mean) not by error rates.

In other contexts, much better accuracy is required. A model for credit-card fraud cannot afford a high false-positive rate because this will result in legitimate transactions not being approved. The result is unhappy card holders canceling their accounts. Even if your client cannot provide an explicit requirement for accuracy, you may be able to derive one from the business context.

Absent any other constraints, I tend to stop trying to improve a model when I reach the point of diminishing returns. When a large effort on my part yields only a minor improvement, my time will probably be better spent on some other problem.

Second Question

This question is really about when to throw out data. I see know reason to discard data just because it happens to be in the tails of the distribution. To use your example where 80% of the records have values between 10 and 60, it may be that all the best customers have a value of 75 or more. It may make sense to throw out records which contain clearly impossible values, but even in that case, I would want to understand how the impossible values were generated. If all the records with impossibly high ages were generated in the same geographic region or from the same distribution channel, throwing them out will bias your sample.

Often, unusual values have some fairly simple explanation. When looking at loyalty card data for a supermarket, we found that there were a few cards that had seemingly impossibly large numbers of orders. The explanation was that when people checked out without their card and were therefore in danger of missing out on a discount, the nice checkout lady took pity on them and used her own card to get them the discount. Understanding that mechanism meant we could safely ignore data for those cards since they did not represent the actual shopping habits of any real customer.

Third Question

Whether or not binning continuous variables is helpful or harmful will depend very much on the particular modeling algorithm you are using and on how the binning is performed. I do not agree that, as a general rule, models are improved by binning continuous variables. As you note, this process destroys information. As an extreme example, suppose you have a relationship that is completely determined by a continuous (or discrete, but with small increments) relationship--a tax of a constant amount per liter, say. The more accurately you can measure the number of liters sold, the more accurately you can estimate the tax revenue. In such a case, binning could only be harmful.

When binning tends to be helpful is when the relationship between the explanatory variable and the thing you are trying to explain is more complex than the particular modeling technique you have chosen can handle. For example, you have chosen a linear model and the relationship is non-linear. I once modeled household penetration for my local newspaper, the Boston Globe. One of my explanatory variables was distance from Boston. Clearly, this should have some effect, but there is only a low level of linear correlation. This is because penetration goes up as a function of distance as you travel out to the first ring of suburbs where penetration is highest, but then goes down again as you continue to travel farther from Boston. So a linear model could not make good use of the untransformed variable, but it could make use of three variables in the form within_three, three_to_ten, and beyond_ten (assuming that 3 and 10 are the right bin boundaries). Of course, binning is not the only transformation that could help and linear models are not the only choice of model.

Friday, September 5, 2008

Sorting Cells in Excel Using Formulas, Part 2

In a previous post, I described how to create a new table in Excel from an existing table where the cells in the new table are sorted by some column in the existing table. In addition, the new table is automatically updated when the values in the original table are modified.

The previously described approach, alas, has some shortcomings:
  • Only one column can be used for the sort key.
  • The column must be numeric.
  • The column cannot have any duplicate values.
This post generalizes on the earlier method by fixing these problems.

If you are interested in this post, you may be interested in my book Data Analysis Using SQL and Excel.


Overview of Simpler Method

The simpler method described in the earlier post recognizes that creating a live sorted table connect to another table consists of the following steps:
  1. Ranking the rows in the table by the column to be sorted.
  2. Using the rank with the OFFSET() function to create the resulting table.
For Step (1), the method uses the built-in RANK() function provided by Excel. This introduces the limitations described above, because RANK() only works on numeric values and produces the same value for duplicates.

The key to fixing these problems is to replace the RANK() function with more general purpose functions.

Instead of RANK()

RANK() determines whether a value is the largest, second largest, third largest, or so on with respect to a list (or smallest, if we are going in the opposite order, which is determined by an optional third argument). One way to think of what it does is that it sorts the values in the list and determines the position of the original value.

An alternative but equivalent way of thinking about the calculation is that it tells us how many values are larger than (or smaller than) the given value. This alternative definition suggests other ways of arriving at the same rankings, such as:

....=COUNTIF(data!B$2:B$55, ">="&data!B2)

This formula can be placed alongside the original table (or anywhere else) and then copied down. It works by counting the number of values that are less than or equal to each value. The resulting ranking is from smallest value to largest value. To reverse the order, simply use "<=" instead. This solves one of the original problems, because the COUNTIF() function works with string data as well as numeric data.

An almost equivalent formulation is to use array functions.

....{=SUM(if(data!B$2:B$55>=data!B2, 1, 0)}

(If you are not familiar with array functions, check out Excel documentation or Data Analysis Using SQL and Excel.)

This is very similar to the COUNTIF() method, although the array functions have one advantage. The conditional logic can be more complicated, so we can do the ranking by multiple columns at the same time.

Using our own version of the rank function fixes two of the three problems. At this point, duplicates still get the same rank value.


Handling Duplicates

The problem with duplicate values is that all these methods assign the same ranking when two different rows have the same value. This makes it impossible to distinguish between the two rows, so one will be included in the sorted table multiple times.

The solution is to fix this problem by adding an offset. If the highest value is repeated multiple times, then all of those rows will have a ranking equal to the number of duplicates. In the following little table, the second column contains the rankings as calculated by either of the above two methods (RANK() does not work because the first column is not numeric):


a 3

a 3

a 3

b 5

b 5

What we want, though, is to have distinct values in the second column:


a 1

a 2

a 3

b 4

b 5

The solution is to subtract a value from the calculated ranking. This is the number of values that we have already seen that are equal to the value in question. Once again, this can be accomplished with either COUNTIF() or array functions:

....=COUNTIF(data!B$2:B$55, ">="&data!B2) + COUNTIF(data!B$2:B2, "="&data!B2)-1

or

....{=SUM(IF(data!B$2:B$55>=data!B2, 1, 0)) + SUM(IF(data!B$2:B2=data!B2, 1, 0))-1}

These formulations consist of two parts. The first part calculates a ranking, where duplicates get the same value. The second part subtracts the number of duplicates already seen in the list. For the simple example above, the results are actually:


a 3

a 2

a 1

b 5

b 4

This works just as well, although it does not preserve the original ordering.

Note that these formulas are all structured so they can be copied down cells and continue working.



What It All Looks Like Together

This method is perhaps best explained by seeing an example. The file sort-in-place.xls contains random information about the fifty states (latitude, longitude, population, and capital for example) on the "data" worksheet. The "data-sorted" worksheet shows the states abbreviations by rank order for each of the columns. For instance, for the size column Alaska is first, followedy by Texas, California, and Montana. For the population columns, the ordering is California, Texas, New York, and Florida. This worksheet using the rankings on the "ranking-countif()" worksheet.

The three worksheets called "ranking-" illustrate the three different methods of doing the rankings -- using RANK(), using COUNTIF(), and using array functions. Note that the RANK() method cannot handle text columns, so it does not work in this case.

If you like, you can change the data on the "data" tab and see the rankings change on the sorted tab. Voila! A sorted table connected by formulas to the original table!


Tuesday, August 26, 2008

MapReduce Functionality in Commercial Databases

If you use LinkedIn, then you have probably been impressed by their "People you may know" feature. I know that I have. From old friends and colleagues to an occasional person I don't necessarily want to see again, the list often contains quite familiar names.

LinkedIn is basically a large graph of connections among people, enhanced with information such as company names, date of link, and so on. We can imagine how they determine whether someone might be in someones "People you may know category", by using common names, companies, and even common paths (people who know each other).

However, trying to imagine how they might determine this information using SQL is more challenging. SQL provides the ability to store a graph of connections. However, traversing the graph is rather complicated in standard SQL. Furthermore, much of the information that LinkedIn maintains is complicated data -- names of companies, job titles, and dates, for instance.

It is not surprising, then, that they are using MapReduce to develop this information. The surprise, though, is that their data is being stored in a relational database, which provides full transactional-integrity and SQL querying capabilities. The commercial database software that supports both is provided by a company called Greenplum.

Greenplum has distringuished itself from other next-generation database vendors by incorporating MapReduce into its database engine. Basically, Greenplum developed a parallel framework for managing data, ported Postgres into this framework, and now has ported MapReduce as well. This is a strong distinction from other database vendors that provide parallel Postgres solutions, and particularly well suited to complex datatypes encountered on the web.

I do want to point out that the integration of MapReduce is at the programming level. In other words, they have not changed SQL; they have added a programming layer that allows data in the database to be readily accessed using MapReduce primitives.

As I've discussed in other posts, MapReduce and SQL are complementary technologies, each with their own strengths. MapReduce can definitely benefit from SQL functionality, since SQL has proven its ability for data storage and access. On the other hand, MapReduce has functionality that is not present in SQL databases.

Now that a database vendor has fully incorporated MapReduce into its database engine, we need to ask: Should MapReduce remain a programming paradigm or should it be incorporated into the SQL query language? What additional keywords and operators and so on are needed to enhance SQL functionality to include MapReduce?

Monday, August 11, 2008

Sorting Cells in Excel Using Formulas

This post describes how to create a new table in Excel from an existing table, where the cells are sorted, and to do this using only formulas. As a result, modifying a value in the original table results in rearranging the sorted table. And, this is accomplished without macros and without using the "sort" menu option.

The material in this post is generalized in another post. Also, if you are interested in this post, you may be interested in my book Data Analysis Using SQL and Excel.

Consider a table with two columns:



12B

14D

13C

11A


The sorted table would be automatically calculated as:



11A

12B

13C

14D


If the first value were changed to 5, then the sorted table would automatically recalculate as:



11A

13C

14D

15B


There are two typical approaches to sorting cells in Excel. The first is to select a region and to sort it using menu options. This does not work when the cells are protected, part of a pivot table, and sometimes when they are calculated. This might also be a bad idea when the data is copied from another location or loaded by accessing a database.

A common alternative is to resort to writing a macro. However, Visual Basic macros are beyond the capabilities of even many experienced Excel users.

The approach described here is much simpler, since it only uses formulas. I should mention that the method described in this post only works for numeric data that has no duplicates. I will remedy that in the next post, where the ideas are extended both to data with duplicates and to character data.

Three Excel functions are the key to the idea: RANK(), MATCH(), and OFFSET(). The first function ranks numbers in a list. The second allows us to use this info to sort the list.

The following shows the effect of the RANK() function:



ValueLabelRank

15B4

14D3

13C2

11A1


The function itself looks like:



ValueLabelRank

15B=RANK(C5, $C$5:$C$8, 1)

14D=RANK(C6, $C$5:$C$8, 1)

13C=RANK(C7, $C$5:$C$8, 1)

11A=RANK(C8, $C$5:$C$8, 1)




The first argument is the value to be ranked. The cell C5 contains the value "15". The second argument is the range to use for h the ranking -- these are all the values in the cell. And the third is the direction, which means the lowest values get the lowest rankings. In this case, "15" is the largest value of four, so its rank is 4.

The following shows the formulas for the table:



RankValueLable

1=OFFSET(C$4, MATCH(C11, $E$5:$E$8, 0), 0)=OFFSET(D$4, MATCH(C11, $E$5:$E$8, 0), 0)

2=OFFSET(C$4, MATCH(C12, $E$5:$E$8, 0), 0)=OFFSET(D$4, MATCH(C12, $E$5:$E$8, 0), 0)

3=OFFSET(C$4, MATCH(C13, $E$5:$E$8, 0), 0)=OFFSET(D$4, MATCH(C13, $E$5:$E$8, 0), 0)

4=OFFSET(C$4, MATCH(C14, $E$5:$E$8, 0), 0)=OFFSET(D$4, MATCH(C14, $E$5:$E$8, 0), 0)


The first column is the desired ranking column. This is simply the numbers starting at 1 and incrementing by 1. The function MATCH(C11, $E$5:$E$8, 0) simply looks up the ranks in the column of calculated ranks. So, the value in C11 is "1". In the previous column, this is the fourth value. The OFFSET() function then finds the fourth value in the C column for the value and the fourth value in the D column for the label.

The result is that the sorted table is tied to the original table by formulas, so changing values in the original table will result in recalculating the sorted table, automatically.

The overall approach is simple to describe. First, we need to calculate the ranking of each row in the original table based on the column that we want to sort. This ranking takes on the values 1 to N fo rthe values in the table. Then, we create a new sorted table that has the rankings in order. This table looks up the appropriate row for each ranking using the MATCH() function. Finally, the OFFSET() function is used to lookup the appropriate values from the appropriate row. The result is a table that is sorted with a "live" connection to another table.

Tuesday, August 5, 2008

B2B Data Mining

Last week, I travelled to Malaysia to apply data mining techniques to business-to-business (B2B) problems. More typically, the work we do at data miners is in the world of business-to-consumer (B2C) , since a company with millions of customers is more likely to need our services. However, a certain gentleman read my book Data Analysis Using SQL and Excel and wanted to apply the ideas to his business.

In this case, the business is a large computer hardware business, that sells complex customized systems to customers. In fact, the systems are so complex that they have a special organization to configure the systems. The joke is that this allows the sales force to spend more time on the golf course. More seriously, it reduces configuration mistakes and helps ensure that the end customers receive the right system.

The data that we worked with was primarily sales data. The data is not tiny; several tables contain millions of rows, and the whole database probably occupies about ten gigabytes of data. However, the data structure is more complex than in a typical B2C database.

First, the important item of data is a quote. This is a configuration of hardware, software, networking equipment, and services provided to the customer. Some quotes result in orders; many do not. Each customer typically has many quotes, and several employees work on each quote. However, typically only one of those employees (let's call that person the sales rep) is credited with the sale.

A quote contains lists of items. However, these items are at the most detailed level (let's call it the UPC -- universal product code -- level). The hierarchy of products is important, particularly the two highest levels.

The highest level is the product category -- hardware, software, and so on. The next highest level is the catalog-listing level. That is, the product name in the catalog. For instance, I am writing this on an old Dell Precision M60 -- that is the "catalog" name. However, the specific disk size, number of megabytes, peripherals, and software might differ from computer to computer. Does it have a 40 Gig disk or an 80 Gig disk? And so on.

What type of questions might be answered by a history of customer quotes, with associated dimension information?

One question is the "quote-to-close" question. This is a simple time-to-event problem of how long from when a quote is created to when the quote results in an order. Generally, companies in this business look at the close rate for quotes, but not the timing of the close. However, the timing provides important information, particularly when it is changing over time.

A related question is what affects the quote-to-close time? Complex quotes with many different components might lengthen the time. Similarly, large discounts that require more levels of approval might lengthen the time. In this case, we were surprised that certain small quotes had a very long time. These small quotes had large percentage discounts, and were often paired with larger orders. The large percentage discount required more levels of approval -- a sign that the company is spending too much effort in the wrong place.

Such data also suggests market basket analysis. "What products are purchased together?" is not quite the right question. Instead, the question is "What configurations are most common?"

Common configuations lead to additional questions. Which configurations (if any) are associated with longer quote-to-close times? Which configuraitons are associated with longer install times? Some products are readily available, some have a longer lead time, and some are built-to-order. The logistics database is not necessarily available for quotes. And, the historical logistics database (what is the lead time for products at a given point in the past) probably does not even exist.

Another genre of question is related to purchase velocity. After a customer makes a purchase, when does the next purchase take place? In this business, purchases can be one of several types. A new purchase represents new equipment; an upgrade is adds additional components to existing equipment; and a replacement replaces one model for another. The pattern of purchases is interesting, both at the customer level and the sales rep level. Some sales reps may be focusing on upgrade business (which is relatively easy, since the customer already has the product), and not expanding into new customers.

A related idea to purchase velocity is to categorize customers by the time since their most recent purchase. What volume of purchases are made by customers who purchased something in the previous month? In the previous twelve months? By year? Most customers make at least one purchase per year. For the dormant customers who return, what caused them to come back.

B2B data is more complicated than B2C, because purchases are more complex and customers are organizations with multiple sites and contacts. Often, people are paid to understand the customers and to meet their needs. However, sales reps understand individual customers. There is ample opportunity for data mining to learn patterns about customers in general and to learn what works and does not work ijn particular circumstances.

Tuesday, July 29, 2008

Nested Subqueries in SQL

A recent question:

You used a lot of multi-layer subqueries. Equivalently, I think we can create intermediate tables or views and query them. It's easier for me to build, to follow, and to debugg especially from dataflow diagram. But I do believe the two approaches will result in different time and space required. Could you elaborate on the difference?

[Note: This question is about the the book Data Analysis Using SQL and Excel.]
This is a good question. Ironically, I received the email while I was sitting in an office in Kuala Lumpur, Malaysia writing very complicated nested queries running a very remote database (the database is actually in San Diego).

In this case, the need for complicated nested subqueries was obvious: I (and the group I was working with) only have read access into the database. And for good reason. Although the database contains key analytic information, it is an operational database. In such cases, analysts often have only read access. (By the way, this brings up a question for Oracle: Why can't a user with read-only access explain a query?)

This very immediate experience provides the first answer to the question. In some circumstances, it is not possible or desireable to write to the database.

However, that is only the first reason. There are other reasons.

One of the jobs of databases is planning the most efficient execution plan. For instance, a query might join several tables together. If we do these joins with intermediate tables, say, two at a time, then we impose an ordering on them. However, one of the most important parts of a SQL optimizer is the part that chooses the ordering of joins. Some ways of doing joins are more efficient than other ways.

So, a second reason is that explicitly storing results in intermediate tables might prevent the SQL optimizer from choosing the most efficient query plan.

The third reason also has to do with the database engine. Database engines manage storage. A complex query plan may produce several intermediate results in temporary tables. The writer of the query does not need to name these tables, keep track of the storage, or remember to delete them. The query engine does this automatically.

Doing analysis in a single query (versus in a script file) saves time and effort in storage management.

Storing data in intermediate tables may also impose constraints on the intermediate tables. In particular, the intermediate tables may not be parallel or the table space used for storage may be inefficient in some other way. Using subqueries eliminates any dependency on possible inefficient temporary user storage.

Another reason has to do with columns in intermediate results that are not used. Eliminating columns in query processing can be significant for improving efficiency of queries. When storing results in an intermediate table, all the columns are stored. When using subqueries, the query plan should include only columns actually needed.

Subqueries allow the query plan to eliminate unused columns from intermediate tables.

A final reason for me is really a personal preference. The task of maintaining separate queries and tables, especially using naming conventions is cumbersome. If I store results in a temporary table, I want the name to mean something. If I'm using a subquery, then the name is less important. Or, if I change the name or type of a column, then I find it easier to make the change in one place rather than distributed through a script file. Using a subquery reduces my need to think of new names. Admittedly, this reason is really personal preference.

In general, writing complicated subqueries actually allows the query optimizer to do what it does best -- determine the most efficient execution plan for running the query. Although there are some exceptions, the general rule is quite simple: try to focus on the questions being answered and not how the results are combined. SQL may not be a perfect language for processing, but it does allow us to do very complex data manipulations with a minimum of programming concerns.