Thursday, October 25, 2007

Marginal Value Models: Building and Using Data Mining Models in SQL Server (Part 1)

[This post is part of a series where I'm exploring how to add data mining functionality to the SQL language; this is an extension of my most recent book Data Analysis With SQL and Excel. The first post is available here.]

A marginal value model is a very simple type of model. However, it gives a good example of how to implement a data mining model using SQL Server extensions. Recall that the model itself produces an expected value, in the same way as the chi-square calculation. This expected value is the model estimate.

The inputs to the model are the dimensions for the estimate, and these are necessarily categorical variables (strings) that take on, preferably, just a handful of values. As a note, the version described here has limits on the total number of values along all dimensions; these limits are imposed by SQL Server and discussed at the end of this post.

This post describes how to use the model. The next post describes how it is implemented. The third post describes some additional useful technical details.

This posting has three files attached:
As explained in the first posting, the first two files are a DLL containing the functionality and a T-SQL script that loads it into the database. The third file contains the code.

BasicMarginalValueModel Type
The marginal value model itself is a data type in SQL Server. The data type BasicMaringalValueModel is implemented as a C# class containing all the information that describes the model as well as various functions, such as:
  • ToString(), which converts the information describing the model to a string.
  • Parse(), which parses a string containing information describing the model.
  • Write(), which is like ToString() except the format is binary instead of character.
  • Read(), which is like Parse() except the format is binary instead of character.
Actually, there is a subtle difference between the ToString()/Parse() and Write()/Read() pairs. Write() and Read() are used implicitly when using a type with SQL Server. They do get called. On the other hand, ToString() and Parse() are only used by SQL Server when reading and writing BasicMarginalValueModel values to or from text. However, ToString() is very handy for seeing what is happening, so it is also used manually.

What does the model "look" like? It looks like pairs of values. So, if the model contained 50 states and three region types ("urban", "rural", and "mixed"), then an instance of the model would contain up to 53 key-value pairs. The key combines the dimension number (0 for state, 1 for region type) and dimension value (state or region type). The second has the value associated with it.

An Aggregation, Another Type, and a Function
Having a type is useful, but how do we create values of the type? The answer is simple, the CreateBasicMarginalValueModel aggregation function. This aggregation adds up the counts on all dimensions.

Ideally, the aggregation function could be called as:

SELECT ud.dbo.CreateMarginalValueModel(dim1, dim2, . . ., value)
FROM t

However, this is not possible, because aggregation functions can only take one argument. The data type MarginalValueModelArgs stores one or more dimensions along with a value. The value would typically be 1; however, it is also possible to create the models on summarized or partially summarized data.


This type has a creation function associated with it, MarginalValueArgs1(). This takes the first dimension and the value. To add more dimensions, the type defines a function AddDim(). The second, third, and so forth dimensions can be added using this function.


Defining functions in user defined types is highly recommended -- except for the issue of performance. Once you have a value of the type, the functions are accessible. And, they do not need to be defined in SQL Server. They come automatically with the type. Of course, accessing the function seems to require shuffling the type data back and forth from the DLL to SQL Server, reducing performance.


What Creating a Model Looks Like
The following code shows one way to create a model using state and region type as two dimensions:

SELECT ud.dbo.CreateBasicMarginalValueModel(arg).ToString()
FROM (SELECT ud.dbo.MarginalValueArgs1(state, 1).
.............AddDim(regtype) as arg
......FROM (SELECT zc.*,
...................(CASE WHEN purban = 1 THEN 'urban'
.........................WHEN purban = 0 THEN 'rural'
.........................ELSE 'mixed' END) as regtype
............FROM sqlbook..zipcensus zc) zc) zc

There are three layers of queries. The innermost query defines the region type. The next level defines the inputs into the model creation routine. Notice that the function MarginalValueArgs1() defines the first dimension on state and the AddDim() function defines the second.

Although this is useful for illustration, the model only exists long enough for us to see it using the ToString() function. When the query stops executing, the model is no longer accessible.

The following code assigns the model to a variable. The model can then be referenced in multiple select statements. Note that for this to work, the current database must be "ud", because that is where the data types are defined. Currently, it is not possible to define variables using data types defines in other databases.

DECLARE @model BasicMarginalValueModel
SET @model =
....(SELECT ud.dbo.CreateBasicMarginalValueModel(arg)
.... FROM (SELECT ud.dbo.MarginalValueArgs1(state, 1).
..................... AddDim(regtype) as arg
.......... FROM (SELECT zc.*,
....................... (CASE WHEN purban = 1 THEN 'urban'
............................. WHEN purban = 0 THEN 'rural'
............................. ELSE 'mixed' END) as regtype
................ FROM sqlbook..zipcensus zc) zc) zc

SELECT @model.ToString()

In this case, the @model variable is accessible for the statements, but it does not persist. However, because the model is just a variable with a complicated type it could also be stored in a table.

Scoring a Model
The process of scoring is simply applying the model to a given set of values. For instance, the following query scores all the rows in the zc table:

SELECT @model.Score(arg) FROM (SELECT ud.dbo.MarginalValueArgs1(state, 1).
.................AddDim(regtype) as arg
......FROM (SELECT zc.*,
...................(CASE WHEN purban = 1 THEN 'urban' .........................WHEN purban = 0 THEN 'rural' .........................ELSE 'mixed' END) as regtype ............FROM sqlbook..zipcensus zc) zc) zc

That is, the dimensions are bundled together into MarginalValueArgs and passed to the model for scoring.

The model can also be used to calculate the chi-squared value (which is probably the most useful thing to do with such a model). This is simply another function in the BasicMarginalValueModel.

SELECT @model.ChiSquared(arg)
FROM (SELECT state, regtype, count(*) as cnt, .............ud.dbo.MarginalValueArgs1(state, count(*)).
.......................................AddDim(regtype) as arg ......FROM (SELECT zc.*,
...................(CASE WHEN purban = 1 THEN 'urban' .........................WHEN purban = 0 THEN 'rural' .........................ELSE 'mixed' END) as regtype ............FROM sqlbook..zipcensus zc) zc
......GROUP BY state, regtype
.....) zc

Limits on the Model

The interface between C# and SQL Server limits the size of the model to 8,000 bytes. This severely limits the size of the model. In future postings, I'll suggest an alternative implementation that gets around this limit.

The next posting discusses the C# implementation and the one after that extensions to the model.


0 Comments:

Post a Comment

<< Home