### Marginal Value Models: Building and Using Data Mining Models in SQL Server (Part 1)

A marginal value model is a very simple type of model. However, it gives a good example of how to implement a data mining model using SQL Server extensions. Recall that the model itself produces an expected value, in the same way as the chi-square calculation. This expected value is the model estimate.

The inputs to the model are the dimensions for the estimate, and these are necessarily categorical variables (strings) that take on, preferably, just a handful of values. As a note, the version described here has limits on the total number of values along all dimensions; these limits are imposed by SQL Server and discussed at the end of this post.

This post describes how to use the model. The next post describes how it is implemented. The third post describes some additional useful technical details.

This posting has three files attached:

As explained in the first posting, the first two files are a DLL containing the functionality and a T-SQL script that loads it into the database. The third file contains the code.

**BasicMarginalValueModel Type**

The marginal value model itself is a data type in SQL Server. The data type BasicMaringalValueModel is implemented as a C# class containing all the information that describes the model as well as various functions, such as:

- ToString(), which converts the information describing the model to a string.
- Parse(), which parses a string containing information describing the model.
- Write(), which is like ToString() except the format is binary instead of character.
- Read(), which is like Parse() except the format is binary instead of character.

What does the model "look" like? It looks like pairs of values. So, if the model contained 50 states and three region types ("urban", "rural", and "mixed"), then an instance of the model would contain up to 53 key-value pairs. The key combines the dimension number (0 for state, 1 for region type) and dimension value (state or region type). The second has the value associated with it.

**An Aggregation, Another Type, and a Function**

Having a type is useful, but how do we create values of the type? The answer is simple, the CreateBasicMarginalValueModel aggregation function. This aggregation adds up the counts on all dimensions.

Ideally, the aggregation function could be called as:

SELECT ud.dbo.CreateMarginalValueModel(dim1, dim2, . . ., value)

FROM t

However, this is not possible, because aggregation functions can only take one argument. The data type MarginalValueModelArgs stores one or more dimensions along with a value. The value would typically be 1; however, it is also possible to create the models on summarized or partially summarized data.

**What Creating a Model Looks Like**

The following code shows one way to create a model using state and region type as two dimensions:

FROM (SELECT ud.dbo.MarginalValueArgs1(state, 1).

...................(CASE WHEN purban = 1 THEN 'urban'

.........................WHEN purban = 0 THEN 'rural'

.........................ELSE 'mixed' END) as regtype

............FROM sqlbook..zipcensus zc) zc) zc

DECLARE @model BasicMarginalValueModel

..................... AddDim(regtype) as arg

**Scoring a Model**

SELECT @model.Score(arg) FROM (SELECT ud.dbo.MarginalValueArgs1(state, 1).

.................AddDim(regtype) as arg

......FROM (SELECT zc.*,

...................(CASE WHEN purban = 1 THEN 'urban' .........................WHEN purban = 0 THEN 'rural' .........................ELSE 'mixed' END) as regtype ............FROM sqlbook..zipcensus zc) zc) zc

That is, the dimensions are bundled together into MarginalValueArgs and passed to the model for scoring.

The model can also be used to calculate the chi-squared value (which is probably the most useful thing to do with such a model). This is simply another function in the BasicMarginalValueModel.

SELECT @model.ChiSquared(arg)

FROM (SELECT state, regtype, count(*) as cnt, .............ud.dbo.MarginalValueArgs1(state, count(*)).

.......................................AddDim(regtype) as arg ......FROM (SELECT zc.*,

...................(CASE WHEN purban = 1 THEN 'urban' .........................WHEN purban = 0 THEN 'rural' .........................ELSE 'mixed' END) as regtype ............FROM sqlbook..zipcensus zc) zc

......GROUP BY state, regtype

.....) zc

**Limits on the Model**

The interface between C# and SQL Server limits the size of the model to 8,000 bytes. This severely limits the size of the model. In future postings, I'll suggest an alternative implementation that gets around this limit.

The next posting discusses the C# implementation and the one after that extensions to the model.