Thursday, June 5, 2008

Qualifications for Studying Data Mining

A recent question . . .

I am hoping to begin my masters degree in Data Mining. I have come from a Software Development primary degree. I am a bit worried over the math involved in Data Mining.Could you tell me, do I need to have a strong mathematical aptitude to produce a good Thesis on Data Mining?

First, I think a software development background is a good foundation for data mining. Data mining is as much about data (and hence computers and databases) as it is about analysis (and hence statistics, probability, and math).

Michael and I are not academics so we cannot speak to the thesis requirements for a particular data mining program. Both of us majored in mathematics (many years ago) and then worked as software engineers. We do have some knowledge of both fields, and the combination provided a good foundation for our data mining work.

To be successful in data mining, you do need some familiarity with math, particularly applied math -- things like practical applications of probability, algebra, the ability to solve word problems, and the ability to use spreadsheets. Unlike theoretical statistics, the purpose of data mining is not to generate rigorous proofs of various theorems; the purpose is to find useful patterns in data, to validate hypotheses, to set up marketing tests. We need to know when patterns are unexpected, and when patterns are expected.

This is a good place to add a plug for my book Data Analysis Using SQL and Excel, which has two or three chapters devoted to practical statistics in the context of data analysis.

In short, if you are math-phobic, then you might want to reconsider data mining. If your challenges in math are solving complex integrals, then you don't have much to worry about.

--gordon

Labels: , ,

Saturday, May 17, 2008

The Agent Problem: Sampling From A Finite Population

A drawer is filled with socks and you remove eight of them randomly. Four are black and four are white. How confident are you in estimating the proportion of white and black socks in the drawer?

The standard statistical approach is to assume that the number of socks in the drawer is infinite, and to use the formula for the standard error of a proportion: SQRT([proportion] * [(1 - [proportion])/[number taken out]) or, more simply, SQRT(p*q/n). In this case, the standard error is SQRT(0.5*0.5/8) = 17.7%

However, this approach clearly does not work in all cases. For instance, if there are exactly eight socks in the drawer, then the sample consists of all of them. We are 100% sure that the proportion is exactly 50%.

If there are ten socks in the drawer, then the proportion of black socks ranges from 4/10 to 6/10. These extremes are within one standard error of the observed average. Or to phrase it differently, any reasonable confidence interval (80%, 90%, 95%) contains all possible values. The confidence interval is wider than what is possible.

What does this have to do with business problems? I encountered essentially the same situation when looking at the longitudinal behavior of patients visiting physicians. I had a sample of patients who had visited the physicians and was measuring the use of a particular therapy for a particular diagnosis. Overall, about 20-30% of all patients where in the longitudinal data. And, I had pretty good estimates of the number of diagnoses for each physician.

There are several reasons why this is important. For the company that provides the therapy, knowing which physicians are using it is important. In addition, if the company does any marketing efforts, they would like to see how they perform. So, the critical question is: how well does the observed patient data characterize the physician behavior.

This is very similar to the question posed earlier. If the patient data contains eight new diagnoises and four start on the therapy of interest, how confident am I that the doctor is starting 50% of new patients on the therapy?

If there are eight patients in total, then I am 100% confident, since all of them managed to be in my sample. On the other hand, if the physician has 200 patients, then the statistical measures of standard error are more appropriate.

The situation is exacerbated by another problem. Although the longitudinal data contains 20%-30% of all patients, the distribution over the physicians is much wider. Some physicians have 10% of their patients in the data and some have 50% or more.

The solution is actually quite simple, but not normally taught in early statistics or business statistics courses. There is something called the finite population correction for exactly this situation.

[stderr-finite] = [stderr-infinite]*fpc
fpc = SQRT(([population size]- [sample size])/([population size] - 1))

So, we simply adjust the standard error and continue with whatever analysis we are using.

There is one caveat to this approach. When observed proportion is 0% or 100%, then the standard error will always be 0, even with the correction. In this case, we need to have a better estimate. In practice, I add or subtract 0.5 from the proportion to calculate the standard error.

This problem is definitely not limited to physicians and medical therapies. I think it becomes an issue in many circumstances where we want to project a global number onto smaller entities.

So, an insurance company may investigate cases for fraud. Overall, they have a large number of cases, but only 5%-10% are in the investigation. If they want to use this information to understand fraud at the agent level, then some agents will have 1% investigated and some 20%. For many of these agents, the correction factor is needed to understand our confidence in their customers' behavior.

The problem occurs because the assumption of an infinite population is reasonable over everyone. However, when we break it into smaller groups (physicians or agents), then the assumption may no longer be valid.

Labels: ,

Monday, April 21, 2008

Using SET with Unique to Join Tables in SAS Data Steps

Recently, I have had to write a bunch of SAS code for one of our clients. Although I strive to do as much as possible using proc sql, there are some things that just require a data step. Alas.

When using the data step, I wish I were able call a query directly:

data whereever;
....set (SELECT beautiful things using SQL syntax);
....and so on with the SAS code

However, this is not possible.

A SAS programmer might point out that there are two easy work-arounds. First, you can simply call the query and save it as a SAS data set. Alternatively, you can define a view and access the view from the data step.

I do not like either of these solutions. One reason why I like SQL is that I can combine many different parts of a solution into a single SQL statement -- my SQL queries usually have lots of subqueries. Another reason I like SQL is it reduces the need for clutter -- intermediate files/tables/data sets -- which need to be named and tracked and managed and eventually deleted. I ran out of clever names for such things about fifteen years ago and much prefer having the database do the dirty work of tracking such things. Perhaps this is why I wrote a book on using SQL for data analysis.

So, I give up on the SQL syntax, but I still want to be able to do similar processing. The data step does make it possible to do joins, using a syntax that is almost intuitive (at least for data step code). The advertised syntax looks like:

proc sql;
....create index lookupkey on lookup;

data whereever;
....set master;
....set lookup (keep=lookupkey lookupvalue) key=lookupkey;
....and so on with the SAS code

This example is highly misleading! (So look below for a better version.) But, before explaining the problems and the solution, let me explain how the code works.

The first statement is a proc sql statement that builds an index on the lookup data set using the lookup key column. Real SAS programmers might prefer proc datasets, but I'm not a real SAS programmer. They do the same thing.

The second statement is the data step. The key part of the data step is the second set statement which uses the key= keyword. This keyword says to look up the corresponding value in another data set and fetch the first row where the values match. The "key" itself is an index, which is why I created the index first.

The keep part of the statement is just for efficiency's sake. This says to only keep the two variables that I want, the lookup key (which is needed for the index) and the lookup value. There may be another two hundred columns in the lookup table (er, data set), but these are the only ones that I want.

This basic example is quite deceptive. Indexes in SAS are a lot like indexes in databases. They are both called indexes and both give fast access to rows in a table, based on values in one or more columns. Both can be created in SQL.

However, they are not the same. The above syntax does work under some circumstances, such as when all the lookup keys are in the lookup table and when no two rows in a row in the master table have the same key (or some strange condition like that). Most importantly, I've found that the syntax seems to work on small test data but not on larger sets. This is a most nefarious type of difference. And, there are no warnings or errors.

The problem is that SAS indexes allow duplicates but treat indexes with duplicate keys differently from indexes with unique keys. Even worse, SAS determines this by how the index is created, not by the context. And for me (the database guy) the most frustrating thing is that the default is for the strange behavior instead of the nice clean behavior I'm expecting. I freely admit a bias here.

So we have to explicitly say that the index has no duplicates. In addition, SAS does not have reasonable behavior when there is no match. "Reasonable" behavior would be to set the lookup value to missing and to continue dutifully processing data. Instead, SAS generates an error and puts garbage in the lookup value.

proc sql;
....create unique index lookupkey on lookup;

data whereever;
....set master;
....set lookup (keep=lookupkey lookupvalue) key=lookupkey/unique;
....if (_iorc_ = %sysrc(_dsenom)) then do;
........_ERROR_ = 0;
........lookupvalue = .;
....end;
....and so on with the SAS code

The important change the presence of the unique signifier in both the create index statement and in the set statement. I have found that having it in one place is not sufficient, even when the index actually has no duplicates.

The error handling also tro ubles me. Strange functions called "iorc" are bad enough, even without being preceding by an underscore. Accessing global symbols such as _ERROR should be a sign that something extraordinary is going on. But nothing unusual is happening; the code is just taking into account the fact that the key is not in the lookup table.

In the end, I can use the data step to mimic SQL joins, including left outer joins (by taking into account, by using appropriate indexes and keys. Although I don't particularly like the syntax, I do find this capability very, very useful. The data step I referred to at the beginning of this post has eleven such lookups, and many of the lookup tables have hundreds of thousands or millions of rows.

--gordon

Labels: ,