Saturday, November 10, 2007

Marginal Value Models: C# Table Valued Functions (Part 3)

[This post is part of a series where I'm exploring how to add data mining functionality to the SQL language; this is an extension of my most recent book Data Analysis With SQL and Excel. The first post is available here.]

The previous two posts introduce marginal value models. Underlying these models is a table of values. This post discusses how this table can be returned in SQL Server. In other words, this post discusses table valued functions.

For reference, the files associates with the model are available at:
The first two files contain the DLL and SQL code for loading functionality into SQL Server. The third file contains the source code for the functionality. These files are slightly different from the previous blog3 files, since I fixed some errors in them.

What Are Table Valued Functions?
In earlier posts, I introduce user defined functions in SQL. These functions have all been scalar functions, whether implemented as user defined functions or as methods in a user defined type. SQL Server also supports user defined table valued functions. The purpose here is to return all the values in a BasicMarginalValueModel.

The following T-SQL code shows an example for this:

DECLARE @mod dbo.BasicMarginalValueModel

SELECT @mod = ud.dbo.CreateBasicMarginalValueModel(arg)
FROM (SELECT TOP 100
.............ud.dbo.MarginalValueArgs2(zc.population,
............... .zc.hhmedincome, 1) as arg
......FROM sqlbook..zipcensus zc) zc

SELECT m.mvme.ToString()
FROM ud.dbo.MarginalValues(@mod) m

The first statement declares a variable called @mod as a BasicMarginalValueModel. The second assigns this variable a value, using the first 100 rows of the table zipcensus (provided on the companion page to the book Data Analysis Using SQL and Excel.

The third statement calls the table valued function MarginalValues(). This function returns the values stored in each cell of the model. So, if there are two dimensions and each has ten values, then this returns twenty rows. Of course, because there are more than one value in the table (a string and a value), a new data type is needed to store these values. This data type is called MarginalValueModelElement. The attached files contain the definitions for these functions and types.

A second table valued function is also defined for the type. This function is called AllCells() and it returns all combinations of the cells. So, if there are ten values along two dimensions, this function returns one hundred rows, one for each combination of the two values. This function also shows that it is possible to have more than one table valued function within a given model.

Defining Table Valued Function in T-SQL
Table valued functions have to be declared in T-SQL. The definition is an extension of the definition of scalar valued functions.

The MarginalValues() function returns a specific type, so this needs to be declared. This is simply:

CREATE TYPE MarginalValueModelElement
EXTERNAL NAME ud.MarginalValueModelElement

GO

The function itself uses the code:

CREATE FUNCTION MarginalValues(@arg BasicMarginalValueModel)
RETURNS TABLE (mvme MarginalValueModelElement)
AS EXTERNAL NAME ud.BasicMarginalValueModel.InitMarginalValueEnumerator
GO

First, notice that table valued functions follow use the same keyword as scalar functions. The difference is the use of RETURNS TABLE rather than just RETURNS. After this keyword comes the table definition. Table valued functions can only return tables with one column. I am not sure if this is a limitation of SQL Server or a C# limitation (table valued functions are implemented as enumerators in C#).

Second, notice that the table valued function is actually defined within the type BasicMarginalValueModel. Scalar functions defined within a type do not need explicit declarations; however, table functions do. Although the function is defined within the type, it is defined as static, so it still needs to take the model as an argument. In fact, all user defined function declared explicitly in SQL Server must be static, both scalar and table functions.

Notice that the function definition defines the name of the colum as mvme. In the previous code, this column name is used to access values.

Within SQL Server, scalar functions and table valued functions are stored separately. After loading blog3enum.dll using blog3enum-load.sql (two files mentioned at the top of this post), the following are in SQL Server:


(I apologize for the small size of this image; I do not know how to make it larger.)
Notice that SQL Server has separate areas for scalar functions and table-valued functions. I find this ironic, since the metadata stores them in the same way.

The Primitives for Implementing Them in C#
The C# code for table valued functions is basically the code for user defined enumerators. A user defined enumerator is something that you use for the foreach statement.

There are three steps for creating a user-defined enumerator in C#:
  1. Declare the class to be an instance of System.Collections.IEnumerable.
  2. Declare the two enumeration functions.
  3. Declare the enumeration class that does all the work.
The next three sections discuss these in a bit more detail.

IEnumerable and IEnumerate
Declaring a table valued function requires declaring a user defined enumeration, and this in turn requires using two underlying classes. The distinction between these two classes is a bit subtle and confusing, although the ideas are not really difficult.

The first class is the IEnumerable class. This class says "hey, I'm a class that supports foreach". We need it, because such classes are actually what table-valued functions are. And this makes sense. A table valued function has a bunch of rows that are returned one-by-one. The foreach clause does the same thing in C#.

The second class is IEnumerate, which we will see used below. This class is not a declaration of an external interface. Instead, it is used in the bowels of the foreach. It maintains the state needed to fetch the next value.

I would like to add one more comment about table valued functions. Unlike aggregation functions, they do not seem to support a parallel interface. This is unfortunate, since this limits the scalability of code using them.

Declaring SQL Table Functions
Two functions are needed to define a table valued function. The first is the enumeration function and the second is a helper function that "fills" a row. These two function are defined as follows:

[SqlFunction(FillRowMethodName = "BVMEnumeratorFillRow")]
public static BVMMElementEnumerator
....InitMarginalValueEnumerator (BVMM csm)

{
....return new BVMMElementEnumerator(csm);
}

public static void
....BVMMEnumeratorFillRow (Object row,
........out MarginalValueModelElement evme)

{
....evme = (MarginalValueModelElement)row;
} // BasicMarginalValueModelEnumeratorFillRow()


(In this code, I have used BVMM for BasicMarginalValueModel so the code formats more easily.)

The first of these functions is the reference used in the CREATE FUNCTION statement. This uses a compiler directive, specific for the SQL Server interface. This directive simply says that the function to call to retrieve each row is called BVMEnumeratorFillRow. Not surprisingly, this is the other function.

The first function returns the enumerator. This is a special class that stores state between calls to the enumerator. This is discussed in the next section.

The underlying C# routines that do the enumerations use very general code that works in terms of objects and that has nothing to do with SQL Server. The interface to SQL Server uses the fill-row routine, which simply copies the appropriate values into the row, and this is handled by casting the object to the appropriate type.

Defining the Enumeration Class
The enumeration class is the most complex part of the definition. However, in this case, the code is rather simple, because it accesses an underlying enumerator used for the Dictionary class.

First a word about the class used for the MarginalValues() SQL function. It is called BasicMarginalValueModelElementEnumerator. The connection between the function in SQL and this class is not readily apparent. It requires looking at the C# code that defines the C# fucntion used to define MarginalValues(). This fucntion is called InitMarginalValueEnumerator() and it creates an instance of this enumeration class.

So, the class must be defined to inherit from System.Collections.IEnumerator; this sets it up to have the appropriate interface for an enumeration.

This class contains the following elements:
  • A private member to store the state. This is an instance of the class System.Collections.IEnumerator.
  • A constructor, which assigns the enumerator from the dictionary to the private member.
  • A MoveNext() function that goes to the next element in the list. This simply calls the dictionary function.
  • A Reset() function that starts over at the beginning.
  • A Current member that returns the current value of the enumerator as an object. It is this object that is then copied into the row, using the fill function.
All of these are defined in terms of the enumeration for the Dictionary class, so the code itself is quite simple. Note that everything in this class is set up only for the enumeration and not for SQL code. The class has no SQL-specific compiler directive, or functions like Write() and Read(). It is the fill-row function that takes the value returned by the enumerator and transfers the value into the SQL Server world.

The AllCells enumeration function provide a more complicated example. In this case, the calculations are done explicitly, because there is no underlying type to support the functionality.

Table Valued Functions and Modeling
Table valued functions are a very powerful feature of SQL Server. However, they are ancillary to my goals, which is to understand how to extend the SQL language to support data mining concepts such as modeling.

They do have one very large short-coming, which is the fact that their interface does not support parallel scalability. This is significant, because my choice of SQL is partly due to its scalability. Remember that the user defined aggregation functions include a Merge() method which does support parallelism. There is no corresponding capability for table valued functions.

The preceding three posts have been a detailed exposition on how to incorporate one type of model into SQL Server. The first explained the model; the second explained the C# code, and this, the third, explains user defined functions.

Much of this has been prepatory. The basic marginal value model is more useful as an example than as a modeling tool. The next post is about making the T-SQL load script a bit simpler. It will then be follwed by the description of another type of model. Naive Bayesian model are quite powerful and useful, and actually quite similar to marginal value models.

1 Comments:

Blogger Rahul said...

I found good resources of c#. Check this out

http://csharptalk.com

December 11, 2009 7:54 AM  

Post a Comment

<< Home