Friday, November 30, 2007

Naive Bayesian Models (Part 1)

[This post is part of a series where I'm exploring how to add data mining functionality to the SQL language; this is an extension of my most recent book Data Analysis With SQL and Excel. The first post is available here.]


The previous posts have shown how to extend SQL Server to support some basic modeling capabilities. This post and the next post add a new type of model, the naive Bayesian model, which is actually quite similar to the marginal value model discussed earlier.

This post explains some of the mathematics behind the model. A more thorough discussion is available in my book Data Analysis Using SQL and Excel.

What Does A Naive Bayesian Model Do?
A naive Bayesian model calculates a probability by combining summary information along different dimensions.

This is perhaps best illustrated by an example. Say that we have a business where 55% of customers survive for the first year. Say that male customers have a 60% probability of remaining a customer after one year and that California customers have an 80% probability. What is the probability that a male customer from California will survive the first year?

The first thing to note is that the question has no correct answer; perhaps men in California are quite different from men elsewhere. The answer could be any number between 0% and 100%.

The second thing to note is the structure of the problem. We are looking for a probability for the intersection of two dimensions (gender and state). To solve this, we have:
  • The overall probability for the population (55%).
  • The probability along each dimension (60% and 80%).
The native Bayesian model combines this information, by making an assumption (which may or may not be true). In this case, the answer is that a male from California has an 83.1% probability for surviving the first year.

The naive Bayesian model can handle any number of dimensions. However, it is always calculating a probability using information about the probabilities along each dimension individually.

Probabilities and Likelihoods
Value of 83.1% may seem surprising. Many people's intuition would put the number between 60% and 80%. Another way of looking at the problem, though, might make this clearer. Being male makes a customer more likely to stay for a year. Being from California also makes a customer even more likely to stay. Combining the information on the two dimensions should be stronger than either dimension individually.

It is one thing to explain this in words. Modeling and data mining requires explaining things with formulas. The problem is about probabilities, but the solution uses a related concept.

The likelihood has a simple formula: likelihood = p / (1-p), where p is the probability. That is, it is the ratio of the probability of something happening to its not happening. Where the probability varies from 0% to 100%, the likelihood varies from zero to infinity. Also, given a likelihood, the probability is easily calculated: p = 1 - (1/(1+likelihood)).

The likehood is also known as the odds. When we say something has 1 in 9 odds, we mean that something happens one time for every nine times it does not happen. Another way of saying this is that the probability is 10%.

For instance, for the following are the likelihoods for the simple problem being discussed:
  • overall likelihood (p = 55%) = 1.22;
  • male likelihood (p = 60%) = 1.50; and,
  • California likelihood (p = 80%) = 4.00.
Notice that the likelihoods vary more dramatically than the probabilities. That is, 80% is just a bit more than 60%, but 4.0 is much larger than 1.5.

The Naive Bayesian Formula
The formula for the naive Bayesian model uses one more concept, the likelihood ratio. This is the ratio of any given likelihood to the overall likelihood. This ratio also varies from zero to infinity. When the likelihood ratio is greater than one, then something is more likely to occur than on average for everyone (such as the case with both males and Californians).

The formula for the naive Bayesian model says the following: the overall likelihood of something occurring along multiple dimensions is the overall likelihood times the likelood ratios along each dimension.

For the example, the formula produces: 1.22*(1.5/1.22)*(4.0/1.22)=4.91. When converted back to a probability this produces 83.1%.

What Does the Naive Assumption Really Mean?
The "Bayesian" in "naive Bayesian" refers to a basic probability formula devised by Rev. Thomas Bayes in the early 1700s. This probability formula is used to devise the formula described above.

The "naive" in naive Bayesian refers to a simple assumption. This is the assumption that the information along the two dimensions is independent. This is the same assumption that we made for the marginal value model. In fact, the two models are very similar. Both combine information along dimensions into a single value. In the first case, it is counts. In the second case, it is probabilities.

In the real world, it is unusual to find dimensions that are truly independent. However, the naive Bayesian approach can still work well in practice. Often, we do not need the actual probabilities. It is sufficient to have relative measures (males from California are better risks than females from Nevada, for instance).

If we further analyzed the data or did a test and learned that males from California really survived at only a 40% rate instead of 83.1%, then this fact would be evidence that state and gender are not independent. The solution is simply to replace state and gender by a single category that combines the two: California-male, California-female, Nevada-male, and so on.

One of the nice features of these models is that they can use a large number of features of the data and readily handle missing information (the likelihood value for a dimension that is missing is simply not included in the equation). This makes them feasible for some applications such as classifying text, which other techniques do not work so well on. It also makes it possible to calculate a probability for a combination of dimensions which has never been seen before -- made possible by the naive assumption.

The next posting contains the code for a basic naive Bayesian model in SQL Server.

0 Comments:

Post a Comment

<< Home