Friday, November 30, 2007

Naive Bayesian Models (Part 1)

[This post is part of a series where I'm exploring how to add data mining functionality to the SQL language; this is an extension of my most recent book Data Analysis With SQL and Excel. The first post is available here.]


The previous posts have shown how to extend SQL Server to support some basic modeling capabilities. This post and the next post add a new type of model, the naive Bayesian model, which is actually quite similar to the marginal value model discussed earlier.

This post explains some of the mathematics behind the model. A more thorough discussion is available in my book Data Analysis Using SQL and Excel.

What Does A Naive Bayesian Model Do?
A naive Bayesian model calculates a probability by combining summary information along different dimensions.

This is perhaps best illustrated by an example. Say that we have a business where 55% of customers survive for the first year. Say that male customers have a 60% probability of remaining a customer after one year and that California customers have an 80% probability. What is the probability that a male customer from California will survive the first year?

The first thing to note is that the question has no correct answer; perhaps men in California are quite different from men elsewhere. The answer could be any number between 0% and 100%.

The second thing to note is the structure of the problem. We are looking for a probability for the intersection of two dimensions (gender and state). To solve this, we have:
  • The overall probability for the population (55%).
  • The probability along each dimension (60% and 80%).
The native Bayesian model combines this information, by making an assumption (which may or may not be true). In this case, the answer is that a male from California has an 83.1% probability for surviving the first year.

The naive Bayesian model can handle any number of dimensions. However, it is always calculating a probability using information about the probabilities along each dimension individually.

Probabilities and Likelihoods
Value of 83.1% may seem surprising. Many people's intuition would put the number between 60% and 80%. Another way of looking at the problem, though, might make this clearer. Being male makes a customer more likely to stay for a year. Being from California also makes a customer even more likely to stay. Combining the information on the two dimensions should be stronger than either dimension individually.

It is one thing to explain this in words. Modeling and data mining requires explaining things with formulas. The problem is about probabilities, but the solution uses a related concept.

The likelihood has a simple formula: likelihood = p / (1-p), where p is the probability. That is, it is the ratio of the probability of something happening to its not happening. Where the probability varies from 0% to 100%, the likelihood varies from zero to infinity. Also, given a likelihood, the probability is easily calculated: p = 1 - (1/(1+likelihood)).

The likehood is also known as the odds. When we say something has 1 in 9 odds, we mean that something happens one time for every nine times it does not happen. Another way of saying this is that the probability is 10%.

For instance, for the following are the likelihoods for the simple problem being discussed:
  • overall likelihood (p = 55%) = 1.22;
  • male likelihood (p = 60%) = 1.50; and,
  • California likelihood (p = 80%) = 4.00.
Notice that the likelihoods vary more dramatically than the probabilities. That is, 80% is just a bit more than 60%, but 4.0 is much larger than 1.5.

The Naive Bayesian Formula
The formula for the naive Bayesian model uses one more concept, the likelihood ratio. This is the ratio of any given likelihood to the overall likelihood. This ratio also varies from zero to infinity. When the likelihood ratio is greater than one, then something is more likely to occur than on average for everyone (such as the case with both males and Californians).

The formula for the naive Bayesian model says the following: the overall likelihood of something occurring along multiple dimensions is the overall likelihood times the likelood ratios along each dimension.

For the example, the formula produces: 1.22*(1.5/1.22)*(4.0/1.22)=4.91. When converted back to a probability this produces 83.1%.

What Does the Naive Assumption Really Mean?
The "Bayesian" in "naive Bayesian" refers to a basic probability formula devised by Rev. Thomas Bayes in the early 1700s. This probability formula is used to devise the formula described above.

The "naive" in naive Bayesian refers to a simple assumption. This is the assumption that the information along the two dimensions is independent. This is the same assumption that we made for the marginal value model. In fact, the two models are very similar. Both combine information along dimensions into a single value. In the first case, it is counts. In the second case, it is probabilities.

In the real world, it is unusual to find dimensions that are truly independent. However, the naive Bayesian approach can still work well in practice. Often, we do not need the actual probabilities. It is sufficient to have relative measures (males from California are better risks than females from Nevada, for instance).

If we further analyzed the data or did a test and learned that males from California really survived at only a 40% rate instead of 83.1%, then this fact would be evidence that state and gender are not independent. The solution is simply to replace state and gender by a single category that combines the two: California-male, California-female, Nevada-male, and so on.

One of the nice features of these models is that they can use a large number of features of the data and readily handle missing information (the likelihood value for a dimension that is missing is simply not included in the equation). This makes them feasible for some applications such as classifying text, which other techniques do not work so well on. It also makes it possible to calculate a probability for a combination of dimensions which has never been seen before -- made possible by the naive assumption.

The next posting contains the code for a basic naive Bayesian model in SQL Server.

Saturday, November 24, 2007

Managing SQL Server Extensions (Functions, Types, Etc.)

[This post is part of a series where I'm exploring how to add data mining functionality to the SQL language; this is an extension of my most recent book Data Analysis With SQL and Excel. The first post is available here.]

Up to now, I have discussed adding functions, types, and aggregates into SQL Server. The code is created in C# and loaded into SQL Server using as an assembly. Placing the code into SQL Server has four steps:
  1. Deleting all objects already defined in the assembly (if any).
  2. Deleting the assembly (if present).
  3. Loading a new version of the assembly.
  4. Redefining the objects in the assembly.
For readers who are familiar with the process of compiling and linking code, this process is similar to linking. The references in the assembly have to be "linked" into the database, so the database knows what the references refer to.

I am doing this process manually for two reasons. First, because this is how I originally set up this project for adding data mining functionality into SQL Server (even though Visual Studio does have options for doing this automatically). Second, this approach provides an opportunity to start to understand how SQL Server manages user defined types.

This post discusses how to manage the first of these steps automatically. That is, it describes how to delete all objects in a database referenced by a particular assembly.


A Common Error
The following code drops the user defined function CreateBasicMarginalValueModel():

DROP AGGREGATE CreateBasicMarginalValueModel
GO

This expression is quite simple. However, if it is executed twice, it returns the error:

Msg 3701, Level 11, State 5, Line 1
Cannot drop the aggregate function 'createbasicmarginalvaluemodel', because it does not exist or you do not have permission.

This is inelegant, because it means that we cannot run the same code to drop a function twice in a row. Even if it works the first time, the second time it runs, the same code will return an error. Furthermore, when we see this error, we do not know if the problem is the non-existance of the object or inadequate database permissions.

To fix this, we use the T-SQL IF construct:

IF OBJECT_ID('CreateBasicMarginalValueModel') IS NOT NULL
....DROP AGGREGATE CreateBasicMarginalValueModel
GO

This simply says that if the object exists, then drop the aggregate. However, this is inelegant, because it mentions the name of the aggregate function twice, once in the "if" clause and once when dropping it. In addition, we do not want to have to explicitly mention every object by name, since we may not know which objects in the assembly were actually declared in the database.

Handling Dependencies
Another problem occurs when we try to drop a type. The following statement:

IF (SELECT COUNT(*) FROM sys.types
....WHERE UPPER(name) = 'BASICMARGINALVALUEMODEL') > 0

....DROP TYPE BasicMarginalValueModel
GO


Returns the enigmatic error:

Msg 3732, Level 16, State 1, Line 3
Cannot drop type 'BasicMarginalValueModel' because it is currently in use.

This error does not, unfortunately, tell us who or what is using the type. In this case, it is the set of functions, procedures, aggregates, and other types that use the type as an argument or return value. We have to remove all these objects before we can remove the type.

In general, we need to remove functions, aggregates, and procedures before we remove types. This ensures that the types have no dependencies on them, so they can be removed cleanly from the database.

This problem with dependencies is actually an advantage. It ensures that code loaded into the database all refers to the proper set of definitions. If a function uses a type, we cannot simply replace the type. We need to drop the function, drop the type, and then re-declare the type and function. This ensures that the function refers to the proper code when using the type.

Finding All User-Defined Functions in an Assembly

The first step to removing a certain class of objects, say functions, is to find all them in the database. They are conveniently located in the sys.objects table, so the following query returns all user-defined functions:

SELECT o.name
FROM sys.objects o
WHERE o.type in ('FS', 'FT')

The sys.objects table contains references to many different objects in the database (almost everything except user defined types). The specific type abbreviations 'FS' and 'FT' refer to scalar functions and table functions, respectively.

The only problem with this code fragment is that returns all user defined functions, there might be user defined functions from different assemblies. What we really want are only user defined functions in the "ud" assembly. To find this, we have to use two more reference tables. To get all the functions in "ud", the query looks like:

SELECT o.name
FROM sys.objects o JOIN
.....sys.assembly_modules am
.....ON o.object_id = am.object_id JOIN
.....sys.assemblies a
.....ON am.assembly_id = a.assembly_id
WHERE a.name = 'ud' and
......o.type in ('FS', 'FT')


This finds all user defined functions only in the desired assembly. The code for procedures and aggregates is quite similar. The only difference is that the type in the WHERE clause matches 'PC' and 'AF', respectively.

User defined types are somewhat different. They are stored in the table sys.types, rather than sys.objects. The query to find all of them is similar, requiring looking up assembly information in additional tables:

SELECT t.name
FROM sys.types t JOIN
.....sys.type_assembly_usages tau
.....ON t.user_type_id = tau.user_type_id JOIN
.....sys.assemblies a
.....ON tau.assembly_id = a.assembly_id
WHERE a.name = 'ud'


Although the query is somewhat different, it returns the name of the user defined types in the given assembly.


Deleting All User Defined Functions
Going from a query that returns a list of user-defined functions (or whatever) to actions on those functions (such as dropping them) requires using the T-SQL command language. In particular, we need to define cursors on the query, so we can do something to each row.

Code that uses cursors has the following structure:

DECLARE @name VARCHAR(2000)
DECLARE the_cursor CURSOR FOR
....QUERY

OPEN the_cursor
....FETCH next FROM the_cursor INTO @name
....WHILE @@fetch_status = 0
....BEGIN
........DO ACTION HERE
........FETCH NEXT FROM the_cursor INTO @name
....END

CLOSE the_cursor
DEALLOCATE the_cursor


The first two lines of the code declare two variables. The first is a standard scalar variable, which is used to store each value returned by the query. The second is a cursor, which is used to cycle through the rows. Notice that the cursor variable is not preceded by an at sign.

Most of the remaining code is the framework used to manage the cursor. It is important to handle cursors correctly. A simple mistake -- such as leaving out the FETCH NEXT FROM -- can result in an infinite loop. We do not want that to happen.

Opening the cursor runs the query and the FETCH NEXT statement gets the next value, which is placed in the local variable @name. When there are no more values, the cursor is closed an deallocated.

The full code for dropping all functions is a bit longer, because the query and action portions are filled in:

DECLARE @function_name VARCHAR(2000)
DECLARE function_cursor CURSOR FOR
....SELECT o.name
....FROM sys.objects o JOIN
.........sys.assembly_modules am
.........ON o.object_id = am.object_id JOIN
.........sys.assemblies a
.........ON a.assembly_id = a.assembly_id
....WHERE a.name = 'udf' AND
..........o.type in ('FS', 'FT')

OPEN function_cursor
....FETCH next FROM function_cursor INFO @function_name
....WHILE @@fetch_status = 0
....BEGIN
........EXEC('DROP FUNCTION '+@function_name)
........FETCH NEXT FROM function_cursor INTO @function_name
....END

CLOSE function_cursor
DEALLOCATE function_cursor
PRINT 'DROPPED FUNCTIONS'


The cursor is defined over the query that returns all the functions in the "ud" assembly. For each of these function, the action is to drop the function. The action uses the EXEC() function rather than just the DROP FUNCTION statement. The EXEC() function takes a string as an argument, and executes the string as a T-SQL statement. This makes it possible to incorporate the name of the function into the command.

The Full Code
The code for dropping aggregates, procedures, and types follows the same structure as the code for dropping functions. The only differences are to the query that defines the cursor and the string passed to the EXEC() function (DROP AGGREGATE, DROP PROCEDURE, or DROP TYPE).

The only important aspect to the code is that types need to be dropped last, because of the dependency problem.

This entry does not include the T-SQL code for this example. The next entry discusses naive Bayesian models. The entry after that will include code that has these enhancements.




Saturday, November 10, 2007

Marginal Value Models: C# Table Valued Functions (Part 3)

[This post is part of a series where I'm exploring how to add data mining functionality to the SQL language; this is an extension of my most recent book Data Analysis With SQL and Excel. The first post is available here.]

The previous two posts introduce marginal value models. Underlying these models is a table of values. This post discusses how this table can be returned in SQL Server. In other words, this post discusses table valued functions.

For reference, the files associates with the model are available at:
The first two files contain the DLL and SQL code for loading functionality into SQL Server. The third file contains the source code for the functionality. These files are slightly different from the previous blog3 files, since I fixed some errors in them.

What Are Table Valued Functions?
In earlier posts, I introduce user defined functions in SQL. These functions have all been scalar functions, whether implemented as user defined functions or as methods in a user defined type. SQL Server also supports user defined table valued functions. The purpose here is to return all the values in a BasicMarginalValueModel.

The following T-SQL code shows an example for this:

DECLARE @mod dbo.BasicMarginalValueModel

SELECT @mod = ud.dbo.CreateBasicMarginalValueModel(arg)
FROM (SELECT TOP 100
.............ud.dbo.MarginalValueArgs2(zc.population,
............... .zc.hhmedincome, 1) as arg
......FROM sqlbook..zipcensus zc) zc

SELECT m.mvme.ToString()
FROM ud.dbo.MarginalValues(@mod) m

The first statement declares a variable called @mod as a BasicMarginalValueModel. The second assigns this variable a value, using the first 100 rows of the table zipcensus (provided on the companion page to the book Data Analysis Using SQL and Excel.

The third statement calls the table valued function MarginalValues(). This function returns the values stored in each cell of the model. So, if there are two dimensions and each has ten values, then this returns twenty rows. Of course, because there are more than one value in the table (a string and a value), a new data type is needed to store these values. This data type is called MarginalValueModelElement. The attached files contain the definitions for these functions and types.

A second table valued function is also defined for the type. This function is called AllCells() and it returns all combinations of the cells. So, if there are ten values along two dimensions, this function returns one hundred rows, one for each combination of the two values. This function also shows that it is possible to have more than one table valued function within a given model.

Defining Table Valued Function in T-SQL
Table valued functions have to be declared in T-SQL. The definition is an extension of the definition of scalar valued functions.

The MarginalValues() function returns a specific type, so this needs to be declared. This is simply:

CREATE TYPE MarginalValueModelElement
EXTERNAL NAME ud.MarginalValueModelElement

GO

The function itself uses the code:

CREATE FUNCTION MarginalValues(@arg BasicMarginalValueModel)
RETURNS TABLE (mvme MarginalValueModelElement)
AS EXTERNAL NAME ud.BasicMarginalValueModel.InitMarginalValueEnumerator
GO

First, notice that table valued functions follow use the same keyword as scalar functions. The difference is the use of RETURNS TABLE rather than just RETURNS. After this keyword comes the table definition. Table valued functions can only return tables with one column. I am not sure if this is a limitation of SQL Server or a C# limitation (table valued functions are implemented as enumerators in C#).

Second, notice that the table valued function is actually defined within the type BasicMarginalValueModel. Scalar functions defined within a type do not need explicit declarations; however, table functions do. Although the function is defined within the type, it is defined as static, so it still needs to take the model as an argument. In fact, all user defined function declared explicitly in SQL Server must be static, both scalar and table functions.

Notice that the function definition defines the name of the colum as mvme. In the previous code, this column name is used to access values.

Within SQL Server, scalar functions and table valued functions are stored separately. After loading blog3enum.dll using blog3enum-load.sql (two files mentioned at the top of this post), the following are in SQL Server:


(I apologize for the small size of this image; I do not know how to make it larger.)
Notice that SQL Server has separate areas for scalar functions and table-valued functions. I find this ironic, since the metadata stores them in the same way.

The Primitives for Implementing Them in C#
The C# code for table valued functions is basically the code for user defined enumerators. A user defined enumerator is something that you use for the foreach statement.

There are three steps for creating a user-defined enumerator in C#:
  1. Declare the class to be an instance of System.Collections.IEnumerable.
  2. Declare the two enumeration functions.
  3. Declare the enumeration class that does all the work.
The next three sections discuss these in a bit more detail.

IEnumerable and IEnumerate
Declaring a table valued function requires declaring a user defined enumeration, and this in turn requires using two underlying classes. The distinction between these two classes is a bit subtle and confusing, although the ideas are not really difficult.

The first class is the IEnumerable class. This class says "hey, I'm a class that supports foreach". We need it, because such classes are actually what table-valued functions are. And this makes sense. A table valued function has a bunch of rows that are returned one-by-one. The foreach clause does the same thing in C#.

The second class is IEnumerate, which we will see used below. This class is not a declaration of an external interface. Instead, it is used in the bowels of the foreach. It maintains the state needed to fetch the next value.

I would like to add one more comment about table valued functions. Unlike aggregation functions, they do not seem to support a parallel interface. This is unfortunate, since this limits the scalability of code using them.

Declaring SQL Table Functions
Two functions are needed to define a table valued function. The first is the enumeration function and the second is a helper function that "fills" a row. These two function are defined as follows:

[SqlFunction(FillRowMethodName = "BVMEnumeratorFillRow")]
public static BVMMElementEnumerator
....InitMarginalValueEnumerator (BVMM csm)

{
....return new BVMMElementEnumerator(csm);
}

public static void
....BVMMEnumeratorFillRow (Object row,
........out MarginalValueModelElement evme)

{
....evme = (MarginalValueModelElement)row;
} // BasicMarginalValueModelEnumeratorFillRow()


(In this code, I have used BVMM for BasicMarginalValueModel so the code formats more easily.)

The first of these functions is the reference used in the CREATE FUNCTION statement. This uses a compiler directive, specific for the SQL Server interface. This directive simply says that the function to call to retrieve each row is called BVMEnumeratorFillRow. Not surprisingly, this is the other function.

The first function returns the enumerator. This is a special class that stores state between calls to the enumerator. This is discussed in the next section.

The underlying C# routines that do the enumerations use very general code that works in terms of objects and that has nothing to do with SQL Server. The interface to SQL Server uses the fill-row routine, which simply copies the appropriate values into the row, and this is handled by casting the object to the appropriate type.

Defining the Enumeration Class
The enumeration class is the most complex part of the definition. However, in this case, the code is rather simple, because it accesses an underlying enumerator used for the Dictionary class.

First a word about the class used for the MarginalValues() SQL function. It is called BasicMarginalValueModelElementEnumerator. The connection between the function in SQL and this class is not readily apparent. It requires looking at the C# code that defines the C# fucntion used to define MarginalValues(). This fucntion is called InitMarginalValueEnumerator() and it creates an instance of this enumeration class.

So, the class must be defined to inherit from System.Collections.IEnumerator; this sets it up to have the appropriate interface for an enumeration.

This class contains the following elements:
  • A private member to store the state. This is an instance of the class System.Collections.IEnumerator.
  • A constructor, which assigns the enumerator from the dictionary to the private member.
  • A MoveNext() function that goes to the next element in the list. This simply calls the dictionary function.
  • A Reset() function that starts over at the beginning.
  • A Current member that returns the current value of the enumerator as an object. It is this object that is then copied into the row, using the fill function.
All of these are defined in terms of the enumeration for the Dictionary class, so the code itself is quite simple. Note that everything in this class is set up only for the enumeration and not for SQL code. The class has no SQL-specific compiler directive, or functions like Write() and Read(). It is the fill-row function that takes the value returned by the enumerator and transfers the value into the SQL Server world.

The AllCells enumeration function provide a more complicated example. In this case, the calculations are done explicitly, because there is no underlying type to support the functionality.

Table Valued Functions and Modeling
Table valued functions are a very powerful feature of SQL Server. However, they are ancillary to my goals, which is to understand how to extend the SQL language to support data mining concepts such as modeling.

They do have one very large short-coming, which is the fact that their interface does not support parallel scalability. This is significant, because my choice of SQL is partly due to its scalability. Remember that the user defined aggregation functions include a Merge() method which does support parallelism. There is no corresponding capability for table valued functions.

The preceding three posts have been a detailed exposition on how to incorporate one type of model into SQL Server. The first explained the model; the second explained the C# code, and this, the third, explains user defined functions.

Much of this has been prepatory. The basic marginal value model is more useful as an example than as a modeling tool. The next post is about making the T-SQL load script a bit simpler. It will then be follwed by the description of another type of model. Naive Bayesian model are quite powerful and useful, and actually quite similar to marginal value models.

Friday, November 2, 2007

Marginal Value Models: Overview of C# Code (Part 2)

[This post is part of a series where I'm exploring how to add data mining functionality to the SQL language; this is an extension of my most recent book Data Analysis With SQL and Excel. The first post is available here.]

Marginal value models are a very simple type of model that calculate expected values along dimensions. The previous posting explains them in more detail.

This posting discusses C# coding issues in implementing the models. The next post discusses one particular aspect, which is the ability to return the marginal values created by the model.

For reference, the files associates with the model are available at:
The first two files contain the DLL and SQL code for loading functionality into SQL Server. The third file contains the source code for the functionality.

Overview of Model and Classes

The marginal value model does a very simple calculation. For each dimension, the marginal value model remembers the counts for all values along all dimensions. The goal is to calculate an expected value for a combination of dimensions, which involves the following steps:
  1. Divide the count for each value by the total count. This gets a p-value for each value.
  2. Multiply all the p-values together.
  3. Multiply the result by the total count.
The result is the expected value. The rest of this post discusses the implementation in C#, starting with the model itself, then the code to create it.

Defining BasicMarginalValueModel

The model is stored as a class. The following declaration defines a class for a model:

[Serializable]
[Microsoft.SqlServer.Server.SqlUserDefinedType(Format.UserDefined,
....MaxByteSize = 8000)]
public class BasicMarginalValueModel :
....INullable, IBinarySerialize, System.Collections.IEnumerable

This definition includes several compiler directives needed for the interface to SQL Server. The first, Serializable, means that the data in the model can be written to and read from, essentially, a file. In English, this implies that the methods Write() and Read() are defined.

The next directive specifies information about the type for the compiler. The maximum size of the type is 8,000 bytes. This is a SQL Server limit, alas. Also, remember that it applies to the Write() version of the model, not to the actual size in memory. The compiler option Format.UserDefined says that we are using useful types, so we need to write our own Write() and Read() routines. SQL Server can handle just a few types automatically; however, writing code using only unsigned integer values is a great limitation.

As a comment on this approach: it turns out that much of what we are doing -- putting values in and out of memory, defining NULL and so on -- is the type of work done by compilers. Fortunately, much of this work is rather mindless and esay. So after doing it once, it is easy to do it again for the next type.

The class itself inherits from three different classes; the first two are described in this entry. The second is decribed in the next one because it introduces a special type of functionality. The first in the list is the INullable class which enables the value to be NULL. In practice, this means that the following code fragment is in the class:

public bool isNull;

public bool IsNull
{
....get
....{
........return isNull;
....}
} // IsNull

public static BasicMarginalValueModel Null
{
....get
....{
........BasicMarginalValueModel bmvm =

............new BasicMarginalValueModel();
....return bmvm;
....} Text Color
} // Null

This code defines the NULL value for the class (this is something that has the type of the class and the value of NULL) and the IsNull property, required by the INullable class. There is little reason to vary this code. Personally, I think the INullable class could just implement it. I suppose the flexibility is there, though, so the boolean variable isNull does not have to be a member of the class.

The IBinarySerialize parent class requires the Read() and Write() functions.

Members of BasicMarginalValueModel

In order to be used, the model must contain the count for each value along each dimension. This table is, in fact, all that is needed for the model. The dimension values are assumed to be strings; the value being stored is the p-value, which is a double. In C#, the appropriate data structure is a dictionary, a built-in data structure which in common parlance is better known as a hash table. Perhaps the biggest strength of C# is the wealth of its built in container classes, so use them liberally.

The first step in using a dictionary is to tell C# where the definition is by including the following line at the top of the file:

using System.Collections.Generic;

The "using" clause is similar to an "#include" in the sense that both bring in outside definitions. However, "using" provides much more detail to the compiler, including compiler directives and definitions.

The dictionary class is generic. We have to tell it the types that it is storing. The following code describes the dictionary as we want to use it:

public System.Collections.Generic.Dictionary
.... marginals;

This syntax says to use the generic dictionary definition, where the key is a string (this is the thing being looked up) and the value is a double (this is the p-value), to define the class variable marginals.

The dictionary uses a trick to store all values along all dimensions. A potential problem is that a given value might be valid for different dimensions. Instead of the key simply being the value, it is a composite key consisting of the dimension number (starting from zero) followed by a colon and then the value. So, the value "upper" for the first dimension would be stored in the key "0:upper". One additional entry in the dictionar is also defined. The value "total:" represents the total count along all dimensions.

By the way, the creation of the key and the parsing of the dimension and value from the key should probably be separate private functions in the class. However, this code does not implement them this way.

The only additional members of the class are isNull and numdimensions.

Notes on Methods in BasicMarginalValueModel
In addition to the standard methods of the class, there are several additional methods. Most involve the functions needed to return the values in the model, which is discussed in the next post. However, two Score() and ChiSquared() are functions that are intended to be accessed from SQL. The advantage of putting these in the model class is that they can be directly accessed from SQL without having to define them using CREATE FUNCTION.

Both these functions call an internal function _Score() to do the calculation. Unfortunately, C# and SQL Server do not do a good job with function overloading, so this function is simply given a different name. That is, if Score() (or any other function) is overloaded, then it generates an error in SQL Server.

The Write() and Read() functions have obvious definitions with two small caveats. First, the number of items in the dictionary is written out. Then the number of dimensions, and then each dictionary entry. The number of items is needed so Read() knows when it is finished. The value is used in the loop.

In addition, there is the danger that "total:" will be defined twice during the Read(), once given the value of zero when an instance of the class is created and once when the dictionary entries are read. To prevent this, the value is removed from the dictionary. This step is not strictly necessary, because it happens not to be there. However, it is a good reminder.


Implementation of CreateBasicMarginalValueModel
Creating an instance of a marginal value model requires an aggregation. Such aggregations make use of the following compiler directives and parent classes:

[Serializable]
[SqlUserDefinedAggregate(Format.UserDefined,
....IsInvariantToNulls = true,
....IsInvariantToDuplicates = false,
....IsInvariantToOrder = false,
....MaxByteSize = 8000)]
public class CreateBasicMarginalValueModel : IBinarySerialize

The compiler directives specify that this is a serializable class with Write() and Read() methods. The second specifies various features of the aggregation. For instance, IsInvariantToNulls means that adding in a NULL value does not change the aggregation (think of the difference between COUNT(*) and COUNT()).

Members and Methods of CreateBasicMarginalValueModel

The class itself contains one member, an instance of BasicMarginalValueModel. This is updated using in the Accumulate() and Merge() members. Accumulate() updates a value for a dimension, either by adding it to the dictionary (if it does not exist) or incrementing the value stored in the dictionary.

Merge() actually does the same thing, just with two different dictionaries. Recall that Merge() is used to support parallelism. Two different processors might aggregate different chunks of data, which are then combined using this function.

Because the aggregation value needs to be passed between SQL Server and C#, the serialization routines need to be defined. However, these are trivial, because they call the corresponding routines for the one member (which are the routines defined for BasicMarginalValueModel).

About MarginalValueArgs
The argument to CreateBasicMarginalValueModel requires both a value and an associated count (because aggregation functions only take one argument, the value and count need to be combined into a single type). This definition is very similar to WeghtedValue described in an earlier posting.

There is a creation function associated with MarginalValueArgs. This is standard whenever adding a type. An associated function is needed to create an instance of the type.

The next posting describes one additional feature of the basic marginal value model. This feature is the ability to list all the values in the model, and it introduces the idea of a table-values function. Such a function is yet another useful extension of SQL Server.