Friday, November 2, 2007

Marginal Value Models: Overview of C# Code (Part 2)

[This post is part of a series where I'm exploring how to add data mining functionality to the SQL language; this is an extension of my most recent book Data Analysis With SQL and Excel. The first post is available here.]

Marginal value models are a very simple type of model that calculate expected values along dimensions. The previous posting explains them in more detail.

This posting discusses C# coding issues in implementing the models. The next post discusses one particular aspect, which is the ability to return the marginal values created by the model.

For reference, the files associates with the model are available at:
The first two files contain the DLL and SQL code for loading functionality into SQL Server. The third file contains the source code for the functionality.

Overview of Model and Classes

The marginal value model does a very simple calculation. For each dimension, the marginal value model remembers the counts for all values along all dimensions. The goal is to calculate an expected value for a combination of dimensions, which involves the following steps:
  1. Divide the count for each value by the total count. This gets a p-value for each value.
  2. Multiply all the p-values together.
  3. Multiply the result by the total count.
The result is the expected value. The rest of this post discusses the implementation in C#, starting with the model itself, then the code to create it.

Defining BasicMarginalValueModel

The model is stored as a class. The following declaration defines a class for a model:

....MaxByteSize = 8000)]
public class BasicMarginalValueModel :
....INullable, IBinarySerialize, System.Collections.IEnumerable

This definition includes several compiler directives needed for the interface to SQL Server. The first, Serializable, means that the data in the model can be written to and read from, essentially, a file. In English, this implies that the methods Write() and Read() are defined.

The next directive specifies information about the type for the compiler. The maximum size of the type is 8,000 bytes. This is a SQL Server limit, alas. Also, remember that it applies to the Write() version of the model, not to the actual size in memory. The compiler option Format.UserDefined says that we are using useful types, so we need to write our own Write() and Read() routines. SQL Server can handle just a few types automatically; however, writing code using only unsigned integer values is a great limitation.

As a comment on this approach: it turns out that much of what we are doing -- putting values in and out of memory, defining NULL and so on -- is the type of work done by compilers. Fortunately, much of this work is rather mindless and esay. So after doing it once, it is easy to do it again for the next type.

The class itself inherits from three different classes; the first two are described in this entry. The second is decribed in the next one because it introduces a special type of functionality. The first in the list is the INullable class which enables the value to be NULL. In practice, this means that the following code fragment is in the class:

public bool isNull;

public bool IsNull
........return isNull;
} // IsNull

public static BasicMarginalValueModel Null
........BasicMarginalValueModel bmvm = BasicMarginalValueModel();
....return bmvm;
....} Text Color
} // Null

This code defines the NULL value for the class (this is something that has the type of the class and the value of NULL) and the IsNull property, required by the INullable class. There is little reason to vary this code. Personally, I think the INullable class could just implement it. I suppose the flexibility is there, though, so the boolean variable isNull does not have to be a member of the class.

The IBinarySerialize parent class requires the Read() and Write() functions.

Members of BasicMarginalValueModel

In order to be used, the model must contain the count for each value along each dimension. This table is, in fact, all that is needed for the model. The dimension values are assumed to be strings; the value being stored is the p-value, which is a double. In C#, the appropriate data structure is a dictionary, a built-in data structure which in common parlance is better known as a hash table. Perhaps the biggest strength of C# is the wealth of its built in container classes, so use them liberally.

The first step in using a dictionary is to tell C# where the definition is by including the following line at the top of the file:

using System.Collections.Generic;

The "using" clause is similar to an "#include" in the sense that both bring in outside definitions. However, "using" provides much more detail to the compiler, including compiler directives and definitions.

The dictionary class is generic. We have to tell it the types that it is storing. The following code describes the dictionary as we want to use it:

public System.Collections.Generic.Dictionary
.... marginals;

This syntax says to use the generic dictionary definition, where the key is a string (this is the thing being looked up) and the value is a double (this is the p-value), to define the class variable marginals.

The dictionary uses a trick to store all values along all dimensions. A potential problem is that a given value might be valid for different dimensions. Instead of the key simply being the value, it is a composite key consisting of the dimension number (starting from zero) followed by a colon and then the value. So, the value "upper" for the first dimension would be stored in the key "0:upper". One additional entry in the dictionar is also defined. The value "total:" represents the total count along all dimensions.

By the way, the creation of the key and the parsing of the dimension and value from the key should probably be separate private functions in the class. However, this code does not implement them this way.

The only additional members of the class are isNull and numdimensions.

Notes on Methods in BasicMarginalValueModel
In addition to the standard methods of the class, there are several additional methods. Most involve the functions needed to return the values in the model, which is discussed in the next post. However, two Score() and ChiSquared() are functions that are intended to be accessed from SQL. The advantage of putting these in the model class is that they can be directly accessed from SQL without having to define them using CREATE FUNCTION.

Both these functions call an internal function _Score() to do the calculation. Unfortunately, C# and SQL Server do not do a good job with function overloading, so this function is simply given a different name. That is, if Score() (or any other function) is overloaded, then it generates an error in SQL Server.

The Write() and Read() functions have obvious definitions with two small caveats. First, the number of items in the dictionary is written out. Then the number of dimensions, and then each dictionary entry. The number of items is needed so Read() knows when it is finished. The value is used in the loop.

In addition, there is the danger that "total:" will be defined twice during the Read(), once given the value of zero when an instance of the class is created and once when the dictionary entries are read. To prevent this, the value is removed from the dictionary. This step is not strictly necessary, because it happens not to be there. However, it is a good reminder.

Implementation of CreateBasicMarginalValueModel
Creating an instance of a marginal value model requires an aggregation. Such aggregations make use of the following compiler directives and parent classes:

....IsInvariantToNulls = true,
....IsInvariantToDuplicates = false,
....IsInvariantToOrder = false,
....MaxByteSize = 8000)]
public class CreateBasicMarginalValueModel : IBinarySerialize

The compiler directives specify that this is a serializable class with Write() and Read() methods. The second specifies various features of the aggregation. For instance, IsInvariantToNulls means that adding in a NULL value does not change the aggregation (think of the difference between COUNT(*) and COUNT()).

Members and Methods of CreateBasicMarginalValueModel

The class itself contains one member, an instance of BasicMarginalValueModel. This is updated using in the Accumulate() and Merge() members. Accumulate() updates a value for a dimension, either by adding it to the dictionary (if it does not exist) or incrementing the value stored in the dictionary.

Merge() actually does the same thing, just with two different dictionaries. Recall that Merge() is used to support parallelism. Two different processors might aggregate different chunks of data, which are then combined using this function.

Because the aggregation value needs to be passed between SQL Server and C#, the serialization routines need to be defined. However, these are trivial, because they call the corresponding routines for the one member (which are the routines defined for BasicMarginalValueModel).

About MarginalValueArgs
The argument to CreateBasicMarginalValueModel requires both a value and an associated count (because aggregation functions only take one argument, the value and count need to be combined into a single type). This definition is very similar to WeghtedValue described in an earlier posting.

There is a creation function associated with MarginalValueArgs. This is standard whenever adding a type. An associated function is needed to create an instance of the type.

The next posting describes one additional feature of the basic marginal value model. This feature is the ability to list all the values in the model, and it introduces the idea of a table-values function. Such a function is yet another useful extension of SQL Server.


Post a Comment

<< Home