Saturday, October 20, 2007

Marginal Value Models: Explanation

This posting describes a very simple type of model used when the target of the model is numeric and all the inputs are categorical variables. This posting explains the concepts behind the models. The next posting has the code associated with them.

I call these models marginal value models. In statistics, the term "marginal" means that we are looking at only one variable at a time. Marginal value models calculate the contribution from each variable, and then combine the results into an expected value.

The chi-square test operates in a similar fashion, but takes the process one step further. The chi-square test compares the actual value to the expected value to determine whether they are sufficiently close to due to small random variations -- or far enough apart to be suspicious. Both marginal value models and the chi-square test are discussed in more detail in my most recent book Data Analysis Using SQL and Excel. Here the emphasis is a bit different; the focus in on implementing this type of model as an extension to Excel.


What are the Marginal Values?

For the purposes of this discussion, the marginal values are the values summarized along one of the dimensions. For instance, if we are interested in the population of different parts of the United States, we might have the population for each state. The following query summarizes this information based on a table of zip code summaries (available on the companion web site to "Data Analysis Using SQL and Excel"):

SELECT state, AVG(medincome), SUM(population)
FROM zipcensus
GROUP BY state

The resulting histogram shows the distribution along this dimension:

The exact values are not known. What if we also know the median income for urban, rural, and mixed areas of the country? These might have the following values:

MIXED 148,595,327
RURAL 27,240,454
URBAN 109,350,778

Given this information about population along two dimensions, how can we combine the information to estimate, say, the rural populatoin of New York?

Those familiar with the chi-square test recognize this as the question of the expected value. In this situation, the expected value is the total population of the state times the total population of the area category divided by the total population in the United States. That is, it is the row total times the column total divided by the total.

For rural Alabama, this results in the following calculation: 4,446,124*27,240,454/285,186,559-424,685. This provides an estimate calculated by combining the inforamtion summarized along each dimension.

Is this estimate accurate? That is quite another question. If the two dimensions are statistically independent, then the estimate is quite accurate. If there is an interaction effect, then the stimate is not accurate. However, if all we have are summaries along the dimensions, then this might be the best that we can do.


Combining Values Along More Than Two Marginal Dimensions

The formula for the expected value can be easily extended to multiple dimensions. The idea is to multiply ratios rather than counts. The two-dimension case can be thought of as the product of the following three numbers:

  • The proportion of the population along dimension 1.
  • The proportion of the population along dimension 2.
  • The total population.

That is, we are multiplying proportions (or probabilities, if you prefer). The idea is that the "probability" of being in Alabama is the population of Alabama divided by the population of the country. The "probability" of being rural is the rural population divided by the population of the county. The "probability" of both is the product. To get the count, we multiply by "joint probability" by the population of the country.

This is easily extended to multiple dimensions. The overall "probability" is the product of the "probabilities" along each dimension. To get the count, we then have to multiply by the overall population. Mathematically, the idea is to combine the distibutions along each dimension, assuming statistical independence. The term "probability" appears in quotes -- it is almost a philosophical question whether "probabilities" are the same as "proportions", and that is not the subject of this posting.

This formulation of the problem is quite similar to naive Bayesian models. The only difference is that here we are working with counts and naive Bayesian models work with ratios. I will return to naive Bayesian models in later postings.


Combining Things That Aren't Counts

Certain things are not counts, but can be treated as counts for the purpose of calculating expected values. The key idea is that the overall totals must be the same (or at least quite close).

For example, the census date contains the proportion of the population that has some colelge degree. What if we wanted to estimate this proportion for the urban population in new York?
What we need for the marginal value model to work is simply the ability to count things up along the dimensions. In this case, we are tempted to count the proportion of the population of interest (since that is the data we have and what the question ultimately asks for).

However, we cannot use proportions because they do not "add up" to the same total numbers along each dimension. This means that if we take the sum of the proportions in each state the total will be quite different than the sum of the proportions for urban, rural, and mixed. If for no other reason, adding up fifty numbers (or so) is unlikely to produce the same result as adding up three.

Fortunately, there is a simple solution. Multiply the proportion by the appropriate population in each group, to get the number of college educated people in each group. This number adds up appropriate along each dimension, so we can use it in the formulas described above.

In the end, we get the number of people in, say rural Alabama who have a college education. We can then divide by the estimate for the population, and arrive at an answer to the question.

This method works with other numbers of interest, such as the average income. The idea would be to multiply the average income times the population to get dollars. Dollars then add up along the dimensions, and we can calculate the appropriate values in each group.

Chi-Square Test

The final topic in this chapter is to point out the calculation of chi-square value, using the marginal value model. The chi-square value is simply:

chi-square value = sum((actual - expected)^2/expected)

The value can be used as a measure of how close the observed data is to the expected values. In other words, it is a measure of how statistically independent the dimensions are. Higher values suggest interdependencies. Values closer to 0 means that the dimensions are independent.

This posting describes the background for marginal value models. The next posting describes how to add them into SQL Server.

0 Comments:

Post a Comment

<< Home