Thursday, October 25, 2007

Marginal Value Models: Building and Using Data Mining Models in SQL Server (Part 1)

[This post is part of a series where I'm exploring how to add data mining functionality to the SQL language; this is an extension of my most recent book Data Analysis With SQL and Excel. The first post is available here.]

A marginal value model is a very simple type of model. However, it gives a good example of how to implement a data mining model using SQL Server extensions. Recall that the model itself produces an expected value, in the same way as the chi-square calculation. This expected value is the model estimate.

The inputs to the model are the dimensions for the estimate, and these are necessarily categorical variables (strings) that take on, preferably, just a handful of values. As a note, the version described here has limits on the total number of values along all dimensions; these limits are imposed by SQL Server and discussed at the end of this post.

This post describes how to use the model. The next post describes how it is implemented. The third post describes some additional useful technical details.

This posting has three files attached:
As explained in the first posting, the first two files are a DLL containing the functionality and a T-SQL script that loads it into the database. The third file contains the code.

BasicMarginalValueModel Type
The marginal value model itself is a data type in SQL Server. The data type BasicMaringalValueModel is implemented as a C# class containing all the information that describes the model as well as various functions, such as:
  • ToString(), which converts the information describing the model to a string.
  • Parse(), which parses a string containing information describing the model.
  • Write(), which is like ToString() except the format is binary instead of character.
  • Read(), which is like Parse() except the format is binary instead of character.
Actually, there is a subtle difference between the ToString()/Parse() and Write()/Read() pairs. Write() and Read() are used implicitly when using a type with SQL Server. They do get called. On the other hand, ToString() and Parse() are only used by SQL Server when reading and writing BasicMarginalValueModel values to or from text. However, ToString() is very handy for seeing what is happening, so it is also used manually.

What does the model "look" like? It looks like pairs of values. So, if the model contained 50 states and three region types ("urban", "rural", and "mixed"), then an instance of the model would contain up to 53 key-value pairs. The key combines the dimension number (0 for state, 1 for region type) and dimension value (state or region type). The second has the value associated with it.

An Aggregation, Another Type, and a Function
Having a type is useful, but how do we create values of the type? The answer is simple, the CreateBasicMarginalValueModel aggregation function. This aggregation adds up the counts on all dimensions.

Ideally, the aggregation function could be called as:

SELECT ud.dbo.CreateMarginalValueModel(dim1, dim2, . . ., value)
FROM t

However, this is not possible, because aggregation functions can only take one argument. The data type MarginalValueModelArgs stores one or more dimensions along with a value. The value would typically be 1; however, it is also possible to create the models on summarized or partially summarized data.


This type has a creation function associated with it, MarginalValueArgs1(). This takes the first dimension and the value. To add more dimensions, the type defines a function AddDim(). The second, third, and so forth dimensions can be added using this function.


Defining functions in user defined types is highly recommended -- except for the issue of performance. Once you have a value of the type, the functions are accessible. And, they do not need to be defined in SQL Server. They come automatically with the type. Of course, accessing the function seems to require shuffling the type data back and forth from the DLL to SQL Server, reducing performance.


What Creating a Model Looks Like
The following code shows one way to create a model using state and region type as two dimensions:

SELECT ud.dbo.CreateBasicMarginalValueModel(arg).ToString()
FROM (SELECT ud.dbo.MarginalValueArgs1(state, 1).
.............AddDim(regtype) as arg
......FROM (SELECT zc.*,
...................(CASE WHEN purban = 1 THEN 'urban'
.........................WHEN purban = 0 THEN 'rural'
.........................ELSE 'mixed' END) as regtype
............FROM sqlbook..zipcensus zc) zc) zc

There are three layers of queries. The innermost query defines the region type. The next level defines the inputs into the model creation routine. Notice that the function MarginalValueArgs1() defines the first dimension on state and the AddDim() function defines the second.

Although this is useful for illustration, the model only exists long enough for us to see it using the ToString() function. When the query stops executing, the model is no longer accessible.

The following code assigns the model to a variable. The model can then be referenced in multiple select statements. Note that for this to work, the current database must be "ud", because that is where the data types are defined. Currently, it is not possible to define variables using data types defines in other databases.

DECLARE @model BasicMarginalValueModel
SET @model =
....(SELECT ud.dbo.CreateBasicMarginalValueModel(arg)
.... FROM (SELECT ud.dbo.MarginalValueArgs1(state, 1).
..................... AddDim(regtype) as arg
.......... FROM (SELECT zc.*,
....................... (CASE WHEN purban = 1 THEN 'urban'
............................. WHEN purban = 0 THEN 'rural'
............................. ELSE 'mixed' END) as regtype
................ FROM sqlbook..zipcensus zc) zc) zc

SELECT @model.ToString()

In this case, the @model variable is accessible for the statements, but it does not persist. However, because the model is just a variable with a complicated type it could also be stored in a table.

Scoring a Model
The process of scoring is simply applying the model to a given set of values. For instance, the following query scores all the rows in the zc table:

SELECT @model.Score(arg) FROM (SELECT ud.dbo.MarginalValueArgs1(state, 1).
.................AddDim(regtype) as arg
......FROM (SELECT zc.*,
...................(CASE WHEN purban = 1 THEN 'urban' .........................WHEN purban = 0 THEN 'rural' .........................ELSE 'mixed' END) as regtype ............FROM sqlbook..zipcensus zc) zc) zc

That is, the dimensions are bundled together into MarginalValueArgs and passed to the model for scoring.

The model can also be used to calculate the chi-squared value (which is probably the most useful thing to do with such a model). This is simply another function in the BasicMarginalValueModel.

SELECT @model.ChiSquared(arg)
FROM (SELECT state, regtype, count(*) as cnt, .............ud.dbo.MarginalValueArgs1(state, count(*)).
.......................................AddDim(regtype) as arg ......FROM (SELECT zc.*,
...................(CASE WHEN purban = 1 THEN 'urban' .........................WHEN purban = 0 THEN 'rural' .........................ELSE 'mixed' END) as regtype ............FROM sqlbook..zipcensus zc) zc
......GROUP BY state, regtype
.....) zc

Limits on the Model

The interface between C# and SQL Server limits the size of the model to 8,000 bytes. This severely limits the size of the model. In future postings, I'll suggest an alternative implementation that gets around this limit.

The next posting discusses the C# implementation and the one after that extensions to the model.


Saturday, October 20, 2007

Marginal Value Models: Explanation

This posting describes a very simple type of model used when the target of the model is numeric and all the inputs are categorical variables. This posting explains the concepts behind the models. The next posting has the code associated with them.

I call these models marginal value models. In statistics, the term "marginal" means that we are looking at only one variable at a time. Marginal value models calculate the contribution from each variable, and then combine the results into an expected value.

The chi-square test operates in a similar fashion, but takes the process one step further. The chi-square test compares the actual value to the expected value to determine whether they are sufficiently close to due to small random variations -- or far enough apart to be suspicious. Both marginal value models and the chi-square test are discussed in more detail in my most recent book Data Analysis Using SQL and Excel. Here the emphasis is a bit different; the focus in on implementing this type of model as an extension to Excel.


What are the Marginal Values?

For the purposes of this discussion, the marginal values are the values summarized along one of the dimensions. For instance, if we are interested in the population of different parts of the United States, we might have the population for each state. The following query summarizes this information based on a table of zip code summaries (available on the companion web site to "Data Analysis Using SQL and Excel"):

SELECT state, AVG(medincome), SUM(population)
FROM zipcensus
GROUP BY state

The resulting histogram shows the distribution along this dimension:

The exact values are not known. What if we also know the median income for urban, rural, and mixed areas of the country? These might have the following values:

MIXED 148,595,327
RURAL 27,240,454
URBAN 109,350,778

Given this information about population along two dimensions, how can we combine the information to estimate, say, the rural populatoin of New York?

Those familiar with the chi-square test recognize this as the question of the expected value. In this situation, the expected value is the total population of the state times the total population of the area category divided by the total population in the United States. That is, it is the row total times the column total divided by the total.

For rural Alabama, this results in the following calculation: 4,446,124*27,240,454/285,186,559-424,685. This provides an estimate calculated by combining the inforamtion summarized along each dimension.

Is this estimate accurate? That is quite another question. If the two dimensions are statistically independent, then the estimate is quite accurate. If there is an interaction effect, then the stimate is not accurate. However, if all we have are summaries along the dimensions, then this might be the best that we can do.


Combining Values Along More Than Two Marginal Dimensions

The formula for the expected value can be easily extended to multiple dimensions. The idea is to multiply ratios rather than counts. The two-dimension case can be thought of as the product of the following three numbers:

  • The proportion of the population along dimension 1.
  • The proportion of the population along dimension 2.
  • The total population.

That is, we are multiplying proportions (or probabilities, if you prefer). The idea is that the "probability" of being in Alabama is the population of Alabama divided by the population of the country. The "probability" of being rural is the rural population divided by the population of the county. The "probability" of both is the product. To get the count, we multiply by "joint probability" by the population of the country.

This is easily extended to multiple dimensions. The overall "probability" is the product of the "probabilities" along each dimension. To get the count, we then have to multiply by the overall population. Mathematically, the idea is to combine the distibutions along each dimension, assuming statistical independence. The term "probability" appears in quotes -- it is almost a philosophical question whether "probabilities" are the same as "proportions", and that is not the subject of this posting.

This formulation of the problem is quite similar to naive Bayesian models. The only difference is that here we are working with counts and naive Bayesian models work with ratios. I will return to naive Bayesian models in later postings.


Combining Things That Aren't Counts

Certain things are not counts, but can be treated as counts for the purpose of calculating expected values. The key idea is that the overall totals must be the same (or at least quite close).

For example, the census date contains the proportion of the population that has some colelge degree. What if we wanted to estimate this proportion for the urban population in new York?
What we need for the marginal value model to work is simply the ability to count things up along the dimensions. In this case, we are tempted to count the proportion of the population of interest (since that is the data we have and what the question ultimately asks for).

However, we cannot use proportions because they do not "add up" to the same total numbers along each dimension. This means that if we take the sum of the proportions in each state the total will be quite different than the sum of the proportions for urban, rural, and mixed. If for no other reason, adding up fifty numbers (or so) is unlikely to produce the same result as adding up three.

Fortunately, there is a simple solution. Multiply the proportion by the appropriate population in each group, to get the number of college educated people in each group. This number adds up appropriate along each dimension, so we can use it in the formulas described above.

In the end, we get the number of people in, say rural Alabama who have a college education. We can then divide by the estimate for the population, and arrive at an answer to the question.

This method works with other numbers of interest, such as the average income. The idea would be to multiply the average income times the population to get dollars. Dollars then add up along the dimensions, and we can calculate the appropriate values in each group.

Chi-Square Test

The final topic in this chapter is to point out the calculation of chi-square value, using the marginal value model. The chi-square value is simply:

chi-square value = sum((actual - expected)^2/expected)

The value can be used as a measure of how close the observed data is to the expected values. In other words, it is a measure of how statistically independent the dimensions are. Higher values suggest interdependencies. Values closer to 0 means that the dimensions are independent.

This posting describes the background for marginal value models. The next posting describes how to add them into SQL Server.

Sunday, October 14, 2007

Two More Useful Aggregate Functions: MinOf() and MaxOf()

The overall purpose of this blog is to investigate adding data mining functionality into SQL Server (see the first post for a more detailed explanation). We have not yet arrived at adding real data mining functionality, since this requires being comfortable with .NET, C#, and extending SQL Server.

This post offers two more aggregation functions that provide a flavor for how to think about adding analytic capabilities. These functions return the value in one column when the value of another column is at a minimum or maximum. I call the functions MinOf() and MaxOf(). As a brief aside, my most recent book Data Analysis Using SQL and Excel describes various other techniques for getting this information in SQL without adding new functions into the database. Unfortunately, none of the methods is actually elegant.

The attached files contain the source code as well as a DLL and SQL script for loading functionality into the database. These files are:
Note that these files contain all the functionality in the blog1 files as well as the new functionality here. (The earlier post Weighted Average: An Example of Enhancing SQL Server Functionality explains how to load the functionality into the database.)

Thinking About the Problem
A good place to start is to think about what the code would ideally look like. The functions would look like:

SELECT 〈whatever〉, MINOF(〈value〉, 〈min-column〉), MAXOF(〈value〉, 〈max-column〉)
FROM 〈table〉
GROUP BY 〈whatever〉

This construct could be used, for instance, to find the first product purchased by each customer. Or, the most recent amount spent for each customer.

Alas, we cannot extend SQL server to support such functions, because aggregation functions can only take one argument. This means that we have to add a new type ValuePair to handle the two arguments. But even more alas, the two elements of ValuePair can be of any type for the function to really be useful (for simplicity, we'll limit it to any built-in basic type). That means that we need yet another user defined type, AnyType. I suppose these could be compressed into a single type that took pairs of anytype. However, it is much cleaner to break the code into these pieces.

The result is that the above code instead looks like:

SELECT 〈whatever〉, MINOF(vp).ToDouble(), MAXOF(vp).ToDouble()
FROM (SELECT t.*,
.............ud.dbo.ValuePair(ud.dbo.AnyDouble(〈value〉),
..............................ud.dbo.AnyDateTime(〈min-column〉)) as vp
......FROM 〈table〉 t) t
GROUP BY 〈whatever〉

The variable vp becomes an instance of ValuePair for each row. In this case, it consists of floating point value (which is the value returned) and a date time column. Of course, there are "Any" functions for all the built-in types.

What The Solution Looks Like
The solution consts of two built-in types, two aggregation functions, and various support functions:
  • AnyType which represents any SQL type;
  • ValuePair which contains two AnyTypes;
  • MinOf() and MaxOf() aggregation functions; and,
  • Various functions to create instances of AnyType and ValuePair.
The trickiest of these is the AnyType type. The remainder are quite simple, so ValuePair has three members:
  • isNull (the null flag);
  • value1 (of type AnyType); and
  • value2 (of type AnyType).
With the appropriate member fucntions for a user defined type. It also has methods for accessing the two values, called Value1 and Value2.

MinOf() and MaxOf() are aggregation functions. Each contains two private members of type AnyType, the minimum value and the minimum variable.

Adding the AnyType Type
The AnyType type needs to store virtually any type allowed in SQL. Internally, it has a structure with the following members:

private struct union_values
{
....public Byte value_int8;
....public Int16 value_int16;
....public Int32 value_int32;
....public Int64 value_int64;
....public float value_single;
....public double value_double;
....public String value_string;
....public Decimal value_decimal;
....public DateTime value_datetime;
};

This would be better as the equivalent of a C union rather than a C struct, since that would use less space in memory. However, for the ToString(), Parse(), Write() and Read() methods, only the one actual value is input or output. Another member is an enumerated type defined as follows:

private enum datatype
{
....dt_int8,
....dt_int16,
....dt_int32,
....dt_int64,
....dt_single,
....dt_double,
....dt_decimal,
....dt_string,
....dt_datetime
};

Each possible types has a member for returning a particular value, such as ToTinyInt(), ToSmallInt(), and so on. These are all accessible from the SQL-side. Each type also has an overloaded constructor. The constructor is not accessible from SQL.

Finally, AllType redefines the "<" and ">" operators. This is needed for the comparisons for MinOf() and MaxOf(). These are complicated by the fact that the two arguments can be of any type. The comparisons follow the rules of SQL, so if either value is NULL then the comparisons return false. Only numerics can be compared to each other, so int8 can be compared to double but not to a character string or datetime.

Creation Functions for AnyType and ValuePair
The following creation functions take a value of a particular type and return an AnyType:
  • AnyTinyInt()
  • AnySmallInt()
  • AnyInt()
  • AnyBigInt()
  • AnyReal()
  • AnyDouble()
  • AnyDecimal()
  • AnyDateTime()
  • AnyString()
Adding an additional value type is quite simple. The following need to be modified:
  • The union_value struct in AnyType needs to store the new type.
  • A new constructor needs to be added for the new value.
  • A new conversion function (To).
  • Modifications to ToString(), Parse(), Write(), and Read().
  • Modify the ">" and "<" operators.
ValuePair, in turn, has a creation function that takes two arguments of AnyType.

The next post moves in a different direction, by talking about a particular type of data mining model, the marginal value model. The first post discusses how the model works rather than how it is implemented.

Tuesday, October 9, 2007

Weighted Average Continued: C# Code

The previous post described how to load the function WAVG() into SQL Server. This post describes the code that generates the DLL.

This discussion assumes that the reader is familiar with C# or object oriented languages similar to C#, such as C++ or java. That said, the code itself is probably readable by most people who are familiar with object-oriented programming practices.

This discussion is composed of four parts:
  • Very basic discussion of Microsoft Visual Studio;
  • Overview of the code and auxiliary modules;
  • Code for Adding CreateWeightedValue() Function;
  • Code for Adding WAvg() Aggregation Function; and,
  • Code for Adding WeightedValue Type.

The last three specifically describe code. These are ordered by difficulty, since it is easiest to add a user defined function, then an aggregation, and then a type (at least in terms of the volume of code produced). The code containing these is available here.

Overview of Microsoft Visual Studio
Microsoft Visual Studio is the application used to develop C# code (as well as code in other languages) using the .NET framework.

Visual Studio divides work into units called projects. These consist of one or more sets of files containing programming code, and they produce something. This something could be many things:
  • A windows application;
  • A ".exe" file executed from the command line;
  • A library to be shared among other projects;
  • A dynamic load library (DLL);
  • A device driver;
  • and so on.
The thing that we want to create is a DLL, since this can be loaded as an assembly into SQL Server.

For the purposes of this example, I have created a new project called blog in the directory c:\gordon\c-sharp\UserDefinedFunctions. The screen looks like:

Creating such a file automatically opens a code file. The source code for Wavg() is can be copied and placed into this file.After the code is in place, the file is created by going to the Build-->Build Blog menu option. Any errors appear at the bottom of the screen. Visual Studio does a good job of catching errors during the compilation process.

Once the project has been built, the DLL is conveniently located in the path \blog\blog\bin\debug\blog.dll. It can be loaded into SQL Server from this location, copied to a more convenient location, and even emailed or moved onto another computer.

Obviously, there is much more to say about Visual Studio. For that, I recommend Microsoft documentation or simply playing with the tool.

Overview of Code and Auxiliary Modules
Converting C# code into a DLL that can be loaded into SQL Server is a tricky process. In particular, .NET has to be sure that streams of bits represent what they are supposed to represent in both C# and SQL Server. In fact, this is a problem. For instance, by default, database values can be NULL. And yet, this is not part of any native C# type. To support compatibility between the systems, the code includes various using clauses and compiler directives.

However, the bulk of the C# code for this project consists primarily of three class definitions. The class WeightedValue defines the type weighted value, which holds a numeric value and a numeric weight (as C# doubles). The class WAvg defines the aggregation function. Finally, the CreateWeightedValue() function is a member of another class, UserDefinedFunctions. Note that the names of the first two classes match the names of the type and aggregation function respectively. The name of the third class is arbitrary, but carefully chosen to convey the notion that it contains user defined functions.

The beginning of the C# library consists of a series of "using" steps. These specify additional modules used by C#, and are similar to the "#include" preprocessor directive in C and C++ code. For instance, this code has the following references:

using System;
using System.IO;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;

The first two specify various system classes that are commonly used. The last specifies classes used specifically for interfacing to Sql Server.

The third is the most interesting, because it defines the classes that contain data going between SQL Server and C#. These are SQL data types. For instance, FLOAT in SQL corresponds to SqlDouble in C#. Basically, the C# classes encapsulate the basic class with a NULL flag.

However, there are some subtleties when passing data back and forth. "CHAR()" is not supported, although "NCHAR()" is. Fortunately, SQL Server automatically converts between these types.

More insidious is the fact that the length of strings and numerics and decimals and money all have to be specified. So, I have not figured out how create a function that takes arbitrary numeric values. User defined functions can only take numerics of a given length. Of course, we can define our own numeric value that never overflows. More typically, though, we simply declare functions to take a FLOAT. This is sufficient for most purposes and gets passed to C# as SqlDouble. For characters, we define them to take some long character value, such as NVARCHAR(2000), which is converted to SqlString.

More complete matching tables are available in Microsoft documentation: http://msdn2.microsoft.com/en-us/library/system.data.sqltypes(vs.71).aspx.

In addition to the using statement, there are also compiler directives. These are applied to classes and to member in classes, as we will see below.

Code for Adding CreateWeightedAverage() Function
The following code provides the full definition for the CreateWeightedAverage() function.
public partial class UserDefinedFunctions
{
[Microsoft.SqlServer.Server.SqlFunction]
public static WeightedValue
CreateWeightedValue (SqlDouble val, SqlDouble wgt)
{
if (val.isNull wgt.IsNull)
{
return WeightedValue.Null;
}
return new WeightedValue(val, wgt);
} // CreateWeightedValue()
} // UserDefinedFunctions

This code defines a class called UserDefinedFunctions. The partial keyword simply means that the class definition may be split over several source code files. In this case, that is not the case.

The function itself is a static member of this class, so it can be called without an instance of the class being created. In fact, the function is going to be called from SQL Server. The function itself starts with a compiler directive that specifies that this is, in fact, a SQL Server function.

The remainder of the function declaration specifies the arguments and the return type. The code in the body is quite simply.

Recall from the last posting that this function is added into SQL Server using the following code:

CREATE FUNCTION CreateWeightedValue (@val float, @wgt float)
RETURNS WeightedValue as
EXTERNAL NAME ud.UserDefinedFunctions.CreateWeightedValue

This shows the correspondence between the SQL Server and C# language elements.

Code for Adding WAvg() Aggregation Function
The code for an aggregation is more complicated than the code for a single function, so the entire code is not included here. An aggregation class has the following structure:
[Serializable]
[SqlUserDefinedAggregate(Format.UserDefined)
public class WAvg : IBinarySerialize
{
private double sum;
private double cnt;

public void Init ()
{ . . . }

public void Accumulate (WeightedValue value)
{ . . . }

public void Merge (WAvg other)
{ . . . }

public SqlDouble Terminate ()
{ . . . }

public void Write (BinaryWriter w)
{ . . . }

public void Read (BinaryReader r)
{ . . . }
} // WAvg

The way this function works is quite simple. It maintains a running sum and sum of weights, When finished, it returns the sum divided by the sum of the weights. Notice that an aggregation function is really a class that contains data, along with some members of that class.

More interesting is the definition itself. First, the function has two compiler directives. The first is [Serializable]. This directive means that the data in the class can be passed back and forth between SQL Server and C#. In particular, it means that there are two special functions that are going to be defined, Write() and Read(). These are never called explicitly, but are part of the interface. These functions "write" the data to memory and then "read" it back . . . the data is written in one process (SQL Server or C#) and then read back in the other.

The second compiler directive specifies that the class is for an aggregation function. Because the type has "complicated" data, C# does not know how to read and write it automatically. This is true of almost all data types, so the format is typically UserDefined. In addition, other options are available as well, and explained in Microsoft documentation.

The class itself is an instance of IBinarySerialize. This is also part of the serialization stuff. Also, once this inheritance is set up, the Write() and Read() functions must be defined or else there is a compiler error.

The other four functions in the interface are actually useful for doing the aggregation. They are not called explicitly, but are used by SQL Server to do the work. The first is Init(), which initializes the values in the class to start a new aggregation. In this case, it sets the sum and the weight to zero.

The function Accumulate() adds in another value. Unfortunately, the accumulation function can only take one argument, which is why we need to create a special type that contains two values. In this case, it simply increments the sum and weight values.

The third function Merge() is probably the lead obvious of the four functions. This function merges two aggregation values. Why would this ever happen? The reason is parallelism. SQL Server might separate the aggregation into multiple threads for performance reasons. This brings together the intermediate results. One super nice thing about this structure is that we get the benefits of parallel, multi-threaded performance without really having to think about it. A very nice thing indeed.

The final function Terminate() is the most harshly named of the four. It returns the final value, in this case as a SQL floating point value (which is equivalent to a C# double).


Code for Adding WeightedValue Type
The final section is for adding in the type. This is similar to the aggregation class, although slightly different.

The following shows the various functions in the user defined type.

[Serializable]
[Microsoft.SqlServer.Server.SqlUserDefinedType(Format.UserDefined)
public class WeightedValue : INullable, IBinarySerialize
{
private bool isNull;
private double value;
private double weight;

public WeightedValue ()
{ . . . }

public bool IsNull
{ . . . }

public static WeightedValue Null
{ . . . }

public override string ToString ()
{ . . . }

public static WeightedValue Parse (SqlString s)
{ . . . }

public void Write (BinaryWriter w)
{ . . . }

public void Read (BinaryReader r)
{ . . . }

public WeightedValue (SqlDouble val, SqlDouble wgt)
{ . . . }

public double Value
{ . . . }

public double Weight
{ . . . }

} // WeightedValue

Although the aggregation function and user defined type are both defined as classes, they do have some fundamental differences. The biggest one is that the type is actually inside SQL Server. This means that the various methods are all available -- even methods not required by the API. This can be a handy way of adding functionality, without bothering to create new user defined functions.

Certain functions are required for a user defined type. First, a constructor is needed that takes no arguments. This usually returns the NULL value. Note that C# does not require destructors.

In addition, NULL and IsNull are also required. Their absence is caught by the compiler, because the user defined type should be an example of INullAble.

The functions Write() and Read() are also required. These behave the same way as for the aggregation function.

Finally, the functions ToString() and Parse() are required. These are very similar to Write() and Read() except that they write the values out to character strings. These strings are intended to be output and input to human beings, preferably in a somewhat understandable format. The function is ToString() particularly useful because it can be used in a query to see what is stored in a value of the type.

This user defined type makes all its elements private, and creates separate accessors for each one (these are called Weight and Value). This is not a requirement. However, it can be good programming practice for protecting members of a class.

The next post extends these ideas for another useful aggregation function, one that returns the value of one column when the value in another is minimized.