If you use LinkedIn, then you have probably been impressed by their "People you may know" feature. I know that I have. From old friends and colleagues to an occasional person I don't necessarily want to see again, the list often contains quite familiar names.
LinkedIn is basically a large graph of connections among people, enhanced with information such as company names, date of link, and so on. We can imagine how they determine whether someone might be in someones "People you may know category", by using common names, companies, and even common paths (people who know each other).
However, trying to imagine how they might determine this information using SQL is more challenging. SQL provides the ability to store a graph of connections. However, traversing the graph is rather complicated in standard SQL. Furthermore, much of the information that LinkedIn maintains is complicated data -- names of companies, job titles, and dates, for instance.
It is not surprising, then, that they are using MapReduce to develop this information. The surprise, though, is that their data is being stored in a relational database, which provides full transactional-integrity and SQL querying capabilities. The commercial database software that supports both is provided by a company called Greenplum.
Greenplum has distringuished itself from other next-generation database vendors by incorporating MapReduce into its database engine. Basically, Greenplum developed a parallel framework for managing data, ported Postgres into this framework, and now has ported MapReduce as well. This is a strong distinction from other database vendors that provide parallel Postgres solutions, and particularly well suited to complex datatypes encountered on the web.
I do want to point out that the integration of MapReduce is at the programming level. In other words, they have not changed SQL; they have added a programming layer that allows data in the database to be readily accessed using MapReduce primitives.
As I've discussed in other posts, MapReduce and SQL are complementary technologies, each with their own strengths. MapReduce can definitely benefit from SQL functionality, since SQL has proven its ability for data storage and access. On the other hand, MapReduce has functionality that is not present in SQL databases.
Now that a database vendor has fully incorporated MapReduce into its database engine, we need to ask: Should MapReduce remain a programming paradigm or should it be incorporated into the SQL query language? What additional keywords and operators and so on are needed to enhance SQL functionality to include MapReduce?
Monday, August 11, 2008
Sorting Cells in Excel Using Formulas
This post describes how to create a new table in Excel from an existing table, where the cells are sorted, and to do this using only formulas. As a result, modifying a value in the original table results in rearranging the sorted table. And, this is accomplished without macros and without using the "sort" menu option.
The material in this post is generalized in another post. Also, if you are interested in this post, you may be interested in my book Data Analysis Using SQL and Excel.
Consider a table with two columns:
The sorted table would be automatically calculated as:
If the first value were changed to 5, then the sorted table would automatically recalculate as:
There are two typical approaches to sorting cells in Excel. The first is to select a region and to sort it using menu options. This does not work when the cells are protected, part of a pivot table, and sometimes when they are calculated. This might also be a bad idea when the data is copied from another location or loaded by accessing a database.
A common alternative is to resort to writing a macro. However, Visual Basic macros are beyond the capabilities of even many experienced Excel users.
The approach described here is much simpler, since it only uses formulas. I should mention that the method described in this post only works for numeric data that has no duplicates. I will remedy that in the next post, where the ideas are extended both to data with duplicates and to character data.
Three Excel functions are the key to the idea: RANK(), MATCH(), and OFFSET(). The first function ranks numbers in a list. The second allows us to use this info to sort the list.
The following shows the effect of the RANK() function:
The function itself looks like:
The first argument is the value to be ranked. The cell C5 contains the value "15". The second argument is the range to use for h the ranking -- these are all the values in the cell. And the third is the direction, which means the lowest values get the lowest rankings. In this case, "15" is the largest value of four, so its rank is 4.
The following shows the formulas for the table:
The first column is the desired ranking column. This is simply the numbers starting at 1 and incrementing by 1. The function MATCH(C11, $E$5:$E$8, 0) simply looks up the ranks in the column of calculated ranks. So, the value in C11 is "1". In the previous column, this is the fourth value. The OFFSET() function then finds the fourth value in the C column for the value and the fourth value in the D column for the label.
The result is that the sorted table is tied to the original table by formulas, so changing values in the original table will result in recalculating the sorted table, automatically.
The overall approach is simple to describe. First, we need to calculate the ranking of each row in the original table based on the column that we want to sort. This ranking takes on the values 1 to N fo rthe values in the table. Then, we create a new sorted table that has the rankings in order. This table looks up the appropriate row for each ranking using the MATCH() function. Finally, the OFFSET() function is used to lookup the appropriate values from the appropriate row. The result is a table that is sorted with a "live" connection to another table.
The material in this post is generalized in another post. Also, if you are interested in this post, you may be interested in my book Data Analysis Using SQL and Excel.
Consider a table with two columns:
12 | B | |
14 | D | |
13 | C | |
11 | A |
The sorted table would be automatically calculated as:
11 | A | |
12 | B | |
13 | C | |
14 | D |
If the first value were changed to 5, then the sorted table would automatically recalculate as:
11 | A | |
13 | C | |
14 | D | |
15 | B |
There are two typical approaches to sorting cells in Excel. The first is to select a region and to sort it using menu options. This does not work when the cells are protected, part of a pivot table, and sometimes when they are calculated. This might also be a bad idea when the data is copied from another location or loaded by accessing a database.
A common alternative is to resort to writing a macro. However, Visual Basic macros are beyond the capabilities of even many experienced Excel users.
The approach described here is much simpler, since it only uses formulas. I should mention that the method described in this post only works for numeric data that has no duplicates. I will remedy that in the next post, where the ideas are extended both to data with duplicates and to character data.
Three Excel functions are the key to the idea: RANK(), MATCH(), and OFFSET(). The first function ranks numbers in a list. The second allows us to use this info to sort the list.
The following shows the effect of the RANK() function:
Value | Label | Rank | |
15 | B | 4 | |
14 | D | 3 | |
13 | C | 2 | |
11 | A | 1 |
The function itself looks like:
Value | Label | Rank | |
15 | B | =RANK(C5, $C$5:$C$8, 1) | |
14 | D | =RANK(C6, $C$5:$C$8, 1) | |
13 | C | =RANK(C7, $C$5:$C$8, 1) | |
11 | A | =RANK(C8, $C$5:$C$8, 1) |
The first argument is the value to be ranked. The cell C5 contains the value "15". The second argument is the range to use for h the ranking -- these are all the values in the cell. And the third is the direction, which means the lowest values get the lowest rankings. In this case, "15" is the largest value of four, so its rank is 4.
The following shows the formulas for the table:
Rank | Value | Lable | |
1 | =OFFSET(C$4, MATCH(C11, $E$5:$E$8, 0), 0) | =OFFSET(D$4, MATCH(C11, $E$5:$E$8, 0), 0) | |
2 | =OFFSET(C$4, MATCH(C12, $E$5:$E$8, 0), 0) | =OFFSET(D$4, MATCH(C12, $E$5:$E$8, 0), 0) | |
3 | =OFFSET(C$4, MATCH(C13, $E$5:$E$8, 0), 0) | =OFFSET(D$4, MATCH(C13, $E$5:$E$8, 0), 0) | |
4 | =OFFSET(C$4, MATCH(C14, $E$5:$E$8, 0), 0) | =OFFSET(D$4, MATCH(C14, $E$5:$E$8, 0), 0) |
The first column is the desired ranking column. This is simply the numbers starting at 1 and incrementing by 1. The function MATCH(C11, $E$5:$E$8, 0) simply looks up the ranks in the column of calculated ranks. So, the value in C11 is "1". In the previous column, this is the fourth value. The OFFSET() function then finds the fourth value in the C column for the value and the fourth value in the D column for the label.
The result is that the sorted table is tied to the original table by formulas, so changing values in the original table will result in recalculating the sorted table, automatically.
The overall approach is simple to describe. First, we need to calculate the ranking of each row in the original table based on the column that we want to sort. This ranking takes on the values 1 to N fo rthe values in the table. Then, we create a new sorted table that has the rankings in order. This table looks up the appropriate row for each ranking using the MATCH() function. Finally, the OFFSET() function is used to lookup the appropriate values from the appropriate row. The result is a table that is sorted with a "live" connection to another table.
Tuesday, August 5, 2008
B2B Data Mining
Last week, I travelled to Malaysia to apply data mining techniques to business-to-business (B2B) problems. More typically, the work we do at data miners is in the world of business-to-consumer (B2C) , since a company with millions of customers is more likely to need our services. However, a certain gentleman read my book Data Analysis Using SQL and Excel and wanted to apply the ideas to his business.
In this case, the business is a large computer hardware business, that sells complex customized systems to customers. In fact, the systems are so complex that they have a special organization to configure the systems. The joke is that this allows the sales force to spend more time on the golf course. More seriously, it reduces configuration mistakes and helps ensure that the end customers receive the right system.
The data that we worked with was primarily sales data. The data is not tiny; several tables contain millions of rows, and the whole database probably occupies about ten gigabytes of data. However, the data structure is more complex than in a typical B2C database.
First, the important item of data is a quote. This is a configuration of hardware, software, networking equipment, and services provided to the customer. Some quotes result in orders; many do not. Each customer typically has many quotes, and several employees work on each quote. However, typically only one of those employees (let's call that person the sales rep) is credited with the sale.
A quote contains lists of items. However, these items are at the most detailed level (let's call it the UPC -- universal product code -- level). The hierarchy of products is important, particularly the two highest levels.
The highest level is the product category -- hardware, software, and so on. The next highest level is the catalog-listing level. That is, the product name in the catalog. For instance, I am writing this on an old Dell Precision M60 -- that is the "catalog" name. However, the specific disk size, number of megabytes, peripherals, and software might differ from computer to computer. Does it have a 40 Gig disk or an 80 Gig disk? And so on.
What type of questions might be answered by a history of customer quotes, with associated dimension information?
One question is the "quote-to-close" question. This is a simple time-to-event problem of how long from when a quote is created to when the quote results in an order. Generally, companies in this business look at the close rate for quotes, but not the timing of the close. However, the timing provides important information, particularly when it is changing over time.
A related question is what affects the quote-to-close time? Complex quotes with many different components might lengthen the time. Similarly, large discounts that require more levels of approval might lengthen the time. In this case, we were surprised that certain small quotes had a very long time. These small quotes had large percentage discounts, and were often paired with larger orders. The large percentage discount required more levels of approval -- a sign that the company is spending too much effort in the wrong place.
Such data also suggests market basket analysis. "What products are purchased together?" is not quite the right question. Instead, the question is "What configurations are most common?"
Common configuations lead to additional questions. Which configurations (if any) are associated with longer quote-to-close times? Which configuraitons are associated with longer install times? Some products are readily available, some have a longer lead time, and some are built-to-order. The logistics database is not necessarily available for quotes. And, the historical logistics database (what is the lead time for products at a given point in the past) probably does not even exist.
Another genre of question is related to purchase velocity. After a customer makes a purchase, when does the next purchase take place? In this business, purchases can be one of several types. A new purchase represents new equipment; an upgrade is adds additional components to existing equipment; and a replacement replaces one model for another. The pattern of purchases is interesting, both at the customer level and the sales rep level. Some sales reps may be focusing on upgrade business (which is relatively easy, since the customer already has the product), and not expanding into new customers.
A related idea to purchase velocity is to categorize customers by the time since their most recent purchase. What volume of purchases are made by customers who purchased something in the previous month? In the previous twelve months? By year? Most customers make at least one purchase per year. For the dormant customers who return, what caused them to come back.
B2B data is more complicated than B2C, because purchases are more complex and customers are organizations with multiple sites and contacts. Often, people are paid to understand the customers and to meet their needs. However, sales reps understand individual customers. There is ample opportunity for data mining to learn patterns about customers in general and to learn what works and does not work ijn particular circumstances.
In this case, the business is a large computer hardware business, that sells complex customized systems to customers. In fact, the systems are so complex that they have a special organization to configure the systems. The joke is that this allows the sales force to spend more time on the golf course. More seriously, it reduces configuration mistakes and helps ensure that the end customers receive the right system.
The data that we worked with was primarily sales data. The data is not tiny; several tables contain millions of rows, and the whole database probably occupies about ten gigabytes of data. However, the data structure is more complex than in a typical B2C database.
First, the important item of data is a quote. This is a configuration of hardware, software, networking equipment, and services provided to the customer. Some quotes result in orders; many do not. Each customer typically has many quotes, and several employees work on each quote. However, typically only one of those employees (let's call that person the sales rep) is credited with the sale.
A quote contains lists of items. However, these items are at the most detailed level (let's call it the UPC -- universal product code -- level). The hierarchy of products is important, particularly the two highest levels.
The highest level is the product category -- hardware, software, and so on. The next highest level is the catalog-listing level. That is, the product name in the catalog. For instance, I am writing this on an old Dell Precision M60 -- that is the "catalog" name. However, the specific disk size, number of megabytes, peripherals, and software might differ from computer to computer. Does it have a 40 Gig disk or an 80 Gig disk? And so on.
What type of questions might be answered by a history of customer quotes, with associated dimension information?
One question is the "quote-to-close" question. This is a simple time-to-event problem of how long from when a quote is created to when the quote results in an order. Generally, companies in this business look at the close rate for quotes, but not the timing of the close. However, the timing provides important information, particularly when it is changing over time.
A related question is what affects the quote-to-close time? Complex quotes with many different components might lengthen the time. Similarly, large discounts that require more levels of approval might lengthen the time. In this case, we were surprised that certain small quotes had a very long time. These small quotes had large percentage discounts, and were often paired with larger orders. The large percentage discount required more levels of approval -- a sign that the company is spending too much effort in the wrong place.
Such data also suggests market basket analysis. "What products are purchased together?" is not quite the right question. Instead, the question is "What configurations are most common?"
Common configuations lead to additional questions. Which configurations (if any) are associated with longer quote-to-close times? Which configuraitons are associated with longer install times? Some products are readily available, some have a longer lead time, and some are built-to-order. The logistics database is not necessarily available for quotes. And, the historical logistics database (what is the lead time for products at a given point in the past) probably does not even exist.
Another genre of question is related to purchase velocity. After a customer makes a purchase, when does the next purchase take place? In this business, purchases can be one of several types. A new purchase represents new equipment; an upgrade is adds additional components to existing equipment; and a replacement replaces one model for another. The pattern of purchases is interesting, both at the customer level and the sales rep level. Some sales reps may be focusing on upgrade business (which is relatively easy, since the customer already has the product), and not expanding into new customers.
A related idea to purchase velocity is to categorize customers by the time since their most recent purchase. What volume of purchases are made by customers who purchased something in the previous month? In the previous twelve months? By year? Most customers make at least one purchase per year. For the dormant customers who return, what caused them to come back.
B2B data is more complicated than B2C, because purchases are more complex and customers are organizations with multiple sites and contacts. Often, people are paid to understand the customers and to meet their needs. However, sales reps understand individual customers. There is ample opportunity for data mining to learn patterns about customers in general and to learn what works and does not work ijn particular circumstances.
Tuesday, July 29, 2008
Nested Subqueries in SQL
A recent question:
This is a good question. Ironically, I received the email while I was sitting in an office in Kuala Lumpur, Malaysia writing very complicated nested queries running a very remote database (the database is actually in San Diego).
In this case, the need for complicated nested subqueries was obvious: I (and the group I was working with) only have read access into the database. And for good reason. Although the database contains key analytic information, it is an operational database. In such cases, analysts often have only read access. (By the way, this brings up a question for Oracle: Why can't a user with read-only access explain a query?)
This very immediate experience provides the first answer to the question. In some circumstances, it is not possible or desireable to write to the database.
However, that is only the first reason. There are other reasons.
One of the jobs of databases is planning the most efficient execution plan. For instance, a query might join several tables together. If we do these joins with intermediate tables, say, two at a time, then we impose an ordering on them. However, one of the most important parts of a SQL optimizer is the part that chooses the ordering of joins. Some ways of doing joins are more efficient than other ways.
So, a second reason is that explicitly storing results in intermediate tables might prevent the SQL optimizer from choosing the most efficient query plan.
The third reason also has to do with the database engine. Database engines manage storage. A complex query plan may produce several intermediate results in temporary tables. The writer of the query does not need to name these tables, keep track of the storage, or remember to delete them. The query engine does this automatically.
Doing analysis in a single query (versus in a script file) saves time and effort in storage management.
Storing data in intermediate tables may also impose constraints on the intermediate tables. In particular, the intermediate tables may not be parallel or the table space used for storage may be inefficient in some other way. Using subqueries eliminates any dependency on possible inefficient temporary user storage.
Another reason has to do with columns in intermediate results that are not used. Eliminating columns in query processing can be significant for improving efficiency of queries. When storing results in an intermediate table, all the columns are stored. When using subqueries, the query plan should include only columns actually needed.
Subqueries allow the query plan to eliminate unused columns from intermediate tables.
A final reason for me is really a personal preference. The task of maintaining separate queries and tables, especially using naming conventions is cumbersome. If I store results in a temporary table, I want the name to mean something. If I'm using a subquery, then the name is less important. Or, if I change the name or type of a column, then I find it easier to make the change in one place rather than distributed through a script file. Using a subquery reduces my need to think of new names. Admittedly, this reason is really personal preference.
In general, writing complicated subqueries actually allows the query optimizer to do what it does best -- determine the most efficient execution plan for running the query. Although there are some exceptions, the general rule is quite simple: try to focus on the questions being answered and not how the results are combined. SQL may not be a perfect language for processing, but it does allow us to do very complex data manipulations with a minimum of programming concerns.
You used a lot of multi-layer subqueries. Equivalently, I think we can create intermediate tables or views and query them. It's easier for me to build, to follow, and to debugg especially from dataflow diagram. But I do believe the two approaches will result in different time and space required. Could you elaborate on the difference?
[Note: This question is about the the book Data Analysis Using SQL and Excel.]
[Note: This question is about the the book Data Analysis Using SQL and Excel.]
In this case, the need for complicated nested subqueries was obvious: I (and the group I was working with) only have read access into the database. And for good reason. Although the database contains key analytic information, it is an operational database. In such cases, analysts often have only read access. (By the way, this brings up a question for Oracle: Why can't a user with read-only access explain a query?)
This very immediate experience provides the first answer to the question. In some circumstances, it is not possible or desireable to write to the database.
However, that is only the first reason. There are other reasons.
One of the jobs of databases is planning the most efficient execution plan. For instance, a query might join several tables together. If we do these joins with intermediate tables, say, two at a time, then we impose an ordering on them. However, one of the most important parts of a SQL optimizer is the part that chooses the ordering of joins. Some ways of doing joins are more efficient than other ways.
So, a second reason is that explicitly storing results in intermediate tables might prevent the SQL optimizer from choosing the most efficient query plan.
The third reason also has to do with the database engine. Database engines manage storage. A complex query plan may produce several intermediate results in temporary tables. The writer of the query does not need to name these tables, keep track of the storage, or remember to delete them. The query engine does this automatically.
Doing analysis in a single query (versus in a script file) saves time and effort in storage management.
Storing data in intermediate tables may also impose constraints on the intermediate tables. In particular, the intermediate tables may not be parallel or the table space used for storage may be inefficient in some other way. Using subqueries eliminates any dependency on possible inefficient temporary user storage.
Another reason has to do with columns in intermediate results that are not used. Eliminating columns in query processing can be significant for improving efficiency of queries. When storing results in an intermediate table, all the columns are stored. When using subqueries, the query plan should include only columns actually needed.
Subqueries allow the query plan to eliminate unused columns from intermediate tables.
A final reason for me is really a personal preference. The task of maintaining separate queries and tables, especially using naming conventions is cumbersome. If I store results in a temporary table, I want the name to mean something. If I'm using a subquery, then the name is less important. Or, if I change the name or type of a column, then I find it easier to make the change in one place rather than distributed through a script file. Using a subquery reduces my need to think of new names. Admittedly, this reason is really personal preference.
In general, writing complicated subqueries actually allows the query optimizer to do what it does best -- determine the most efficient execution plan for running the query. Although there are some exceptions, the general rule is quite simple: try to focus on the questions being answered and not how the results are combined. SQL may not be a perfect language for processing, but it does allow us to do very complex data manipulations with a minimum of programming concerns.
Thursday, June 5, 2008
Qualifications for Studying Data Mining
A recent question . . .
I am hoping to begin my masters degree in Data Mining. I have come from a Software Development primary degree. I am a bit worried over the math involved in Data Mining.Could you tell me, do I need to have a strong mathematical aptitude to produce a good Thesis on Data Mining?
First, I think a software development background is a good foundation for data mining. Data mining is as much about data (and hence computers and databases) as it is about analysis (and hence statistics, probability, and math).
Michael and I are not academics so we cannot speak to the thesis requirements for a particular data mining program. Both of us majored in mathematics (many years ago) and then worked as software engineers. We do have some knowledge of both fields, and the combination provided a good foundation for our data mining work.
To be successful in data mining, you do need some familiarity with math, particularly applied math -- things like practical applications of probability, algebra, the ability to solve word problems, and the ability to use spreadsheets. Unlike theoretical statistics, the purpose of data mining is not to generate rigorous proofs of various theorems; the purpose is to find useful patterns in data, to validate hypotheses, to set up marketing tests. We need to know when patterns are unexpected, and when patterns are expected.
This is a good place to add a plug for my book Data Analysis Using SQL and Excel, which has two or three chapters devoted to practical statistics in the context of data analysis.
In short, if you are math-phobic, then you might want to reconsider data mining. If your challenges in math are solving complex integrals, then you don't have much to worry about.
--gordon
I am hoping to begin my masters degree in Data Mining. I have come from a Software Development primary degree. I am a bit worried over the math involved in Data Mining.Could you tell me, do I need to have a strong mathematical aptitude to produce a good Thesis on Data Mining?
First, I think a software development background is a good foundation for data mining. Data mining is as much about data (and hence computers and databases) as it is about analysis (and hence statistics, probability, and math).
Michael and I are not academics so we cannot speak to the thesis requirements for a particular data mining program. Both of us majored in mathematics (many years ago) and then worked as software engineers. We do have some knowledge of both fields, and the combination provided a good foundation for our data mining work.
To be successful in data mining, you do need some familiarity with math, particularly applied math -- things like practical applications of probability, algebra, the ability to solve word problems, and the ability to use spreadsheets. Unlike theoretical statistics, the purpose of data mining is not to generate rigorous proofs of various theorems; the purpose is to find useful patterns in data, to validate hypotheses, to set up marketing tests. We need to know when patterns are unexpected, and when patterns are expected.
This is a good place to add a plug for my book Data Analysis Using SQL and Excel, which has two or three chapters devoted to practical statistics in the context of data analysis.
In short, if you are math-phobic, then you might want to reconsider data mining. If your challenges in math are solving complex integrals, then you don't have much to worry about.
--gordon
Saturday, May 17, 2008
The Agent Problem: Sampling From A Finite Population
A drawer is filled with socks and you remove eight of them randomly. Four are black and four are white. How confident are you in estimating the proportion of white and black socks in the drawer?
The standard statistical approach is to assume that the number of socks in the drawer is infinite, and to use the formula for the standard error of a proportion: SQRT([proportion] * [(1 - [proportion])/[number taken out]) or, more simply, SQRT(p*q/n). In this case, the standard error is SQRT(0.5*0.5/8) = 17.7%
However, this approach clearly does not work in all cases. For instance, if there are exactly eight socks in the drawer, then the sample consists of all of them. We are 100% sure that the proportion is exactly 50%.
If there are ten socks in the drawer, then the proportion of black socks ranges from 4/10 to 6/10. These extremes are within one standard error of the observed average. Or to phrase it differently, any reasonable confidence interval (80%, 90%, 95%) contains all possible values. The confidence interval is wider than what is possible.
What does this have to do with business problems? I encountered essentially the same situation when looking at the longitudinal behavior of patients visiting physicians. I had a sample of patients who had visited the physicians and was measuring the use of a particular therapy for a particular diagnosis. Overall, about 20-30% of all patients where in the longitudinal data. And, I had pretty good estimates of the number of diagnoses for each physician.
There are several reasons why this is important. For the company that provides the therapy, knowing which physicians are using it is important. In addition, if the company does any marketing efforts, they would like to see how they perform. So, the critical question is: how well does the observed patient data characterize the physician behavior.
This is very similar to the question posed earlier. If the patient data contains eight new diagnoises and four start on the therapy of interest, how confident am I that the doctor is starting 50% of new patients on the therapy?
If there are eight patients in total, then I am 100% confident, since all of them managed to be in my sample. On the other hand, if the physician has 200 patients, then the statistical measures of standard error are more appropriate.
The situation is exacerbated by another problem. Although the longitudinal data contains 20%-30% of all patients, the distribution over the physicians is much wider. Some physicians have 10% of their patients in the data and some have 50% or more.
The solution is actually quite simple, but not normally taught in early statistics or business statistics courses. There is something called the finite population correction for exactly this situation.
So, we simply adjust the standard error and continue with whatever analysis we are using.
There is one caveat to this approach. When observed proportion is 0% or 100%, then the standard error will always be 0, even with the correction. In this case, we need to have a better estimate. In practice, I add or subtract 0.5 from the proportion to calculate the standard error.
This problem is definitely not limited to physicians and medical therapies. I think it becomes an issue in many circumstances where we want to project a global number onto smaller entities.
So, an insurance company may investigate cases for fraud. Overall, they have a large number of cases, but only 5%-10% are in the investigation. If they want to use this information to understand fraud at the agent level, then some agents will have 1% investigated and some 20%. For many of these agents, the correction factor is needed to understand our confidence in their customers' behavior.
The problem occurs because the assumption of an infinite population is reasonable over everyone. However, when we break it into smaller groups (physicians or agents), then the assumption may no longer be valid.
The standard statistical approach is to assume that the number of socks in the drawer is infinite, and to use the formula for the standard error of a proportion: SQRT([proportion] * [(1 - [proportion])/[number taken out]) or, more simply, SQRT(p*q/n). In this case, the standard error is SQRT(0.5*0.5/8) = 17.7%
However, this approach clearly does not work in all cases. For instance, if there are exactly eight socks in the drawer, then the sample consists of all of them. We are 100% sure that the proportion is exactly 50%.
If there are ten socks in the drawer, then the proportion of black socks ranges from 4/10 to 6/10. These extremes are within one standard error of the observed average. Or to phrase it differently, any reasonable confidence interval (80%, 90%, 95%) contains all possible values. The confidence interval is wider than what is possible.
What does this have to do with business problems? I encountered essentially the same situation when looking at the longitudinal behavior of patients visiting physicians. I had a sample of patients who had visited the physicians and was measuring the use of a particular therapy for a particular diagnosis. Overall, about 20-30% of all patients where in the longitudinal data. And, I had pretty good estimates of the number of diagnoses for each physician.
There are several reasons why this is important. For the company that provides the therapy, knowing which physicians are using it is important. In addition, if the company does any marketing efforts, they would like to see how they perform. So, the critical question is: how well does the observed patient data characterize the physician behavior.
This is very similar to the question posed earlier. If the patient data contains eight new diagnoises and four start on the therapy of interest, how confident am I that the doctor is starting 50% of new patients on the therapy?
If there are eight patients in total, then I am 100% confident, since all of them managed to be in my sample. On the other hand, if the physician has 200 patients, then the statistical measures of standard error are more appropriate.
The situation is exacerbated by another problem. Although the longitudinal data contains 20%-30% of all patients, the distribution over the physicians is much wider. Some physicians have 10% of their patients in the data and some have 50% or more.
The solution is actually quite simple, but not normally taught in early statistics or business statistics courses. There is something called the finite population correction for exactly this situation.
[stderr-finite] = [stderr-infinite]*fpc
fpc = SQRT(([population size]- [sample size])/([population size] - 1))
So, we simply adjust the standard error and continue with whatever analysis we are using.
There is one caveat to this approach. When observed proportion is 0% or 100%, then the standard error will always be 0, even with the correction. In this case, we need to have a better estimate. In practice, I add or subtract 0.5 from the proportion to calculate the standard error.
This problem is definitely not limited to physicians and medical therapies. I think it becomes an issue in many circumstances where we want to project a global number onto smaller entities.
So, an insurance company may investigate cases for fraud. Overall, they have a large number of cases, but only 5%-10% are in the investigation. If they want to use this information to understand fraud at the agent level, then some agents will have 1% investigated and some 20%. For many of these agents, the correction factor is needed to understand our confidence in their customers' behavior.
The problem occurs because the assumption of an infinite population is reasonable over everyone. However, when we break it into smaller groups (physicians or agents), then the assumption may no longer be valid.
Saturday, May 3, 2008
Adjusting for oversampling
A few days ago, a reader of this blog used the "ask a data miner" link on the right to mail us this question. (Or, these questions.)
Question:
Especially with decision tree models, we suggest using stratified sampling to construct a model set with approximately equal numbers of each of the outcome classes. In the most common case, there are two classes and one, the one of interest, is much rarer than the other. People rarely respond to direct mail; loan recipients rarely default; health care providers rarely commit fraud. The difficulty with rare classes is that decision tree algorithms keep splitting the model set into smaller and smaller groups of records that are purer and purer. When one class is very rare, the data passes the purity test before any splits are made. The resulting model always predicts the common outcome. If only 1% of claims are fraudulent, a model that says no claims are fraudulent will be correct 99% of the time! It will also be useless. Creating a balanced model set where half the cases are fraud forces the algorithm to work harder to differentiate between the two classes.
To answer the first question, yes, there is a formula for adjusting the predicted probability produced on the oversampled data to get the predicted probability on data with the true distribution of classes in the population. Suppose there is 1% fraud in the population and 50% fraud in the model set. Each example of fraud in the model set represents a single actual case of fraud in the population while each non-fraud case in the model set represents 99 cases of fraud in the population. We say the oversampling rate is 99. So, if a certain leaf in a decision tree built on the balanced data has 95 fraudulent cases and 5 non-fraudulent cases, the actual probability of fraud predicted by that leaf is 95/(95+5*99) or about 0.16 because each of the 5 non-fraudulent cases represents 99 such cases. We discuss this at length in Chapter 7 of our book, Mastering Data Mining. You can also arrive at this result by applying the model to the original, non-oversampled data and simply counting the number of records of each class found at each leaf. This is sometimes called backfitting the model.
To answer the second question, this calculation is only necessary if you are actually trying to estimate the probability of the classes. If all you want to do is generate scores that can be used to rank order a list or compare lift for several models all built from the oversampled data, there is no need to correct for oversampling because the order or results will not change.
Using oversampled data also changes the results of logistic regression models, but in a more complicated way. As it happens, this is a particular interest of Professor Gary King, who taught the only actual class in statistics that I have ever taken. He has written several papers on the subject.
Question:
When modeling rare events in marketing, it has been suggested by many to take a sample stratified by the dependent variable(s) in order to allow the modeling technique a better chance of detecting a difference (or differences in the case of k-level targets). The guidance for the proportion of the event in the sample seems to range between 15-50% for a binary outcome (no guidance have I seen for a k-level target). I am confused by this oversampling and have a couple questions I am hoping you can help with:
- Is there a formula for adjusting the predicted probability of the event (s) when there has been oversampling on the dependent?
- Is this correction only needed when ascertaining the lift of the model or comparing to other models which were trained on a dataset with the same oversampling proportion OR would you need the adjustment to be done to the predicted value when you score a new dataset - such as when you train a model on a previous campaign and use the model to score the candidates for the upcoming campaign?
- I use logistic regression and decision trees for classification of categorical dependent variables - does the adjustment from (question 1) apply to both of these? Does the answer to (question 2) also apply to both of these techniques?
Especially with decision tree models, we suggest using stratified sampling to construct a model set with approximately equal numbers of each of the outcome classes. In the most common case, there are two classes and one, the one of interest, is much rarer than the other. People rarely respond to direct mail; loan recipients rarely default; health care providers rarely commit fraud. The difficulty with rare classes is that decision tree algorithms keep splitting the model set into smaller and smaller groups of records that are purer and purer. When one class is very rare, the data passes the purity test before any splits are made. The resulting model always predicts the common outcome. If only 1% of claims are fraudulent, a model that says no claims are fraudulent will be correct 99% of the time! It will also be useless. Creating a balanced model set where half the cases are fraud forces the algorithm to work harder to differentiate between the two classes.
To answer the first question, yes, there is a formula for adjusting the predicted probability produced on the oversampled data to get the predicted probability on data with the true distribution of classes in the population. Suppose there is 1% fraud in the population and 50% fraud in the model set. Each example of fraud in the model set represents a single actual case of fraud in the population while each non-fraud case in the model set represents 99 cases of fraud in the population. We say the oversampling rate is 99. So, if a certain leaf in a decision tree built on the balanced data has 95 fraudulent cases and 5 non-fraudulent cases, the actual probability of fraud predicted by that leaf is 95/(95+5*99) or about 0.16 because each of the 5 non-fraudulent cases represents 99 such cases. We discuss this at length in Chapter 7 of our book, Mastering Data Mining. You can also arrive at this result by applying the model to the original, non-oversampled data and simply counting the number of records of each class found at each leaf. This is sometimes called backfitting the model.
To answer the second question, this calculation is only necessary if you are actually trying to estimate the probability of the classes. If all you want to do is generate scores that can be used to rank order a list or compare lift for several models all built from the oversampled data, there is no need to correct for oversampling because the order or results will not change.
Using oversampled data also changes the results of logistic regression models, but in a more complicated way. As it happens, this is a particular interest of Professor Gary King, who taught the only actual class in statistics that I have ever taken. He has written several papers on the subject.
Subscribe to:
Posts (Atom)