## Tuesday, September 15, 2009

If you don´t mind, I would like to ask you a Question regarding Oversampling as you wrote in your book (Mastering Data Mining...).

I can understand how you calculate predictive lift when using oversampling, though don´t know how to do it for the confusion matrix.

Would you mind telling me how do I compute then the confusion matrix for the actual population (not the oversampled set)?

Best,
Diego

Gentlemen-

I have severely unbalanced training data (180K negative cases, 430 positive cases). Yeah...very unbalanced.

I fit a model in a software program that allows instance weights (weka). I give all the positive cases a weight of 1 and all the negative cases a weight of 0.0024. I fit a model (not a decision tree so running the data through a test set is not an option to recalibrate) - like a neural network. I output the probabilities and they are out of whack - good for predicting the class or ranking but not for comparing predicted probability against actual.

What can we do to fit a model like this but then output probabilities that are in line with the distribution? Is this new (wrong) probabilities just the price we have to pay for instance weights to (1) get a model to build (2) get reasonably good classification? Can I have my cake and eat it too (classification and probs that are close to actual)?

Many many thanks!
Brian

The problem in these cases is the same. The goal is to predict a class, usually a binary class, where one outcome is rarer than the other. To generate the best model, some method of oversampling is used so the model set has equal numbers of the two outcomes. There are two common ways of doing this. Diego is probably using all the rare outcomes and an equal-sized random sample of the common outcomes. This is most useful when there are a large number of cases, and reducing the number of rows makes the modeling tools run faster. Brian is using a method where weights are used for the same purpose. Rare cases are given a weight of 1 and common cases are given a weight less than 1, so that the sum of the weights of the two groups is equal.

Regardless of the technique (neural network, decision trees, logistic regression, neearest neighbor, and so on), the resulting probabilities are "directionally" correct. A group of rows with a larger probability are more likey to have the modeled outcome than a group with a lower probability. This is useful for some purposes, such as getting the top 10% with the highest scores. It is not useful for other purposes, where the actual probability is needed.

Some tools can back into the desired probabilities, and do correct calculations for lift and for the confusion matrix. I think SAS Enterprise Miner, for instance, uses prior probabilties for this purpose. I say "think" because I do not actually use this feature. When I need to do this calculation, I do it manually, because not all tools support it. And, even if they do, why bother learning how. I can easily do the necessary calculations in Excel.

The key idea here is simply counting. Assume that we start with data that is 10% rare and 90% common, and we oversample so it is 50%-50%. The relationship between the original data and the model set is:
• rare outcomes: 10% --> 50%
• common outcomes: 90% --> 50%
To put it differently, each rare outcome in the original data is worth 5 in the model set. Each common outcome is worth 5/9 in the model set. We can call these numbers the oversampling rates for each of the outcomes.

We now apply these mappings to the results. Let's answer Brian's question for a particular situation. Say we have the above data and a result has a modeled probability of 80%. What is the actual probability?

Well, 25% means that there is 0.25 rare outcomes for 0.75 common ones. Let's undo the mapping above:
• 0.80 / 5 = 0.16
• 0.20 / (5/9) = 0.36
So, the expected probability on the original data is 0.16/(0.16+0.36) = 30.8%. Notice that the probability has decreased, but it is still larger than the 10% in the original data. Also notice that the lift on the model set is 80%/50% = 1.6. The lift on the original data is 3.08 (30.8% / 10%). The expected probability goes down, and the lift goes up.

This calculation can also be used for the cross-correlation matrix (or confusion matrix). In this case, you just have to divide each cell by the appropriate overampling rate. So, if the confusion matrix said:
• 10 rows in the model set are rare and classified as rare
• 5 rows in the model set are rare and classified as common
• 3 rows in the model set are common and classified as rare
• 12 rows in the model set are common and classified as common
(I apologize for not including a table, but that is more trouble than it is worth in the blog.)

In the original data, this means:
• 2=10/5 rows in the original data are rare and classified as rare
• 1=5/5 rows in the original data are rare and classified as common
• 5.4 = 3/(5/9) rows inthe original data are common and classified as rare
• 21.6 = 12/(5/9) rows in the original data are common and classified as common
These calculations are quite simple, and it is easy to set up a spreadsheet to do them.

I should also mention that this method readily works for any number of classes. Having two classes is simply the most common case.

Labels: , ,

Jeff Allard said...

Good Post!

I have also used the technique of Platt successfully in a case of weighted re-balancing of the training set (basically using logistic regression to map probabilities in the oversampled space to those of the actual).

Here is a paper for the interested:
http://lkm.fri.uni-lj.si/xaigor/eng/scipaper/NiculescuMizilCaruana05-icml.pdf

Jeff

September 18, 2009 8:21 AM
Angelica said...

Gentlemen,

I have been working using over and down sampling for really unbalanced data but reading Brian's question I'm wondering. Is it better to use over sampling or use the weights he mentions? Do you have papers that compare that two techniques?

September 21, 2009 10:43 AM