Sunday, October 14, 2007

Two More Useful Aggregate Functions: MinOf() and MaxOf()

The overall purpose of this blog is to investigate adding data mining functionality into SQL Server (see the first post for a more detailed explanation). We have not yet arrived at adding real data mining functionality, since this requires being comfortable with .NET, C#, and extending SQL Server.

This post offers two more aggregation functions that provide a flavor for how to think about adding analytic capabilities. These functions return the value in one column when the value of another column is at a minimum or maximum. I call the functions MinOf() and MaxOf(). As a brief aside, my most recent book Data Analysis Using SQL and Excel describes various other techniques for getting this information in SQL without adding new functions into the database. Unfortunately, none of the methods is actually elegant.

The attached files contain the source code as well as a DLL and SQL script for loading functionality into the database. These files are:
Note that these files contain all the functionality in the blog1 files as well as the new functionality here. (The earlier post Weighted Average: An Example of Enhancing SQL Server Functionality explains how to load the functionality into the database.)

Thinking About the Problem
A good place to start is to think about what the code would ideally look like. The functions would look like:

SELECT 〈whatever〉, MINOF(〈value〉, 〈min-column〉), MAXOF(〈value〉, 〈max-column〉)
FROM 〈table〉
GROUP BY 〈whatever〉

This construct could be used, for instance, to find the first product purchased by each customer. Or, the most recent amount spent for each customer.

Alas, we cannot extend SQL server to support such functions, because aggregation functions can only take one argument. This means that we have to add a new type ValuePair to handle the two arguments. But even more alas, the two elements of ValuePair can be of any type for the function to really be useful (for simplicity, we'll limit it to any built-in basic type). That means that we need yet another user defined type, AnyType. I suppose these could be compressed into a single type that took pairs of anytype. However, it is much cleaner to break the code into these pieces.

The result is that the above code instead looks like:

SELECT 〈whatever〉, MINOF(vp).ToDouble(), MAXOF(vp).ToDouble()
..............................ud.dbo.AnyDateTime(〈min-column〉)) as vp
......FROM 〈table〉 t) t
GROUP BY 〈whatever〉

The variable vp becomes an instance of ValuePair for each row. In this case, it consists of floating point value (which is the value returned) and a date time column. Of course, there are "Any" functions for all the built-in types.

What The Solution Looks Like
The solution consts of two built-in types, two aggregation functions, and various support functions:
  • AnyType which represents any SQL type;
  • ValuePair which contains two AnyTypes;
  • MinOf() and MaxOf() aggregation functions; and,
  • Various functions to create instances of AnyType and ValuePair.
The trickiest of these is the AnyType type. The remainder are quite simple, so ValuePair has three members:
  • isNull (the null flag);
  • value1 (of type AnyType); and
  • value2 (of type AnyType).
With the appropriate member fucntions for a user defined type. It also has methods for accessing the two values, called Value1 and Value2.

MinOf() and MaxOf() are aggregation functions. Each contains two private members of type AnyType, the minimum value and the minimum variable.

Adding the AnyType Type
The AnyType type needs to store virtually any type allowed in SQL. Internally, it has a structure with the following members:

private struct union_values
....public Byte value_int8;
....public Int16 value_int16;
....public Int32 value_int32;
....public Int64 value_int64;
....public float value_single;
....public double value_double;
....public String value_string;
....public Decimal value_decimal;
....public DateTime value_datetime;

This would be better as the equivalent of a C union rather than a C struct, since that would use less space in memory. However, for the ToString(), Parse(), Write() and Read() methods, only the one actual value is input or output. Another member is an enumerated type defined as follows:

private enum datatype

Each possible types has a member for returning a particular value, such as ToTinyInt(), ToSmallInt(), and so on. These are all accessible from the SQL-side. Each type also has an overloaded constructor. The constructor is not accessible from SQL.

Finally, AllType redefines the "<" and ">" operators. This is needed for the comparisons for MinOf() and MaxOf(). These are complicated by the fact that the two arguments can be of any type. The comparisons follow the rules of SQL, so if either value is NULL then the comparisons return false. Only numerics can be compared to each other, so int8 can be compared to double but not to a character string or datetime.

Creation Functions for AnyType and ValuePair
The following creation functions take a value of a particular type and return an AnyType:
  • AnyTinyInt()
  • AnySmallInt()
  • AnyInt()
  • AnyBigInt()
  • AnyReal()
  • AnyDouble()
  • AnyDecimal()
  • AnyDateTime()
  • AnyString()
Adding an additional value type is quite simple. The following need to be modified:
  • The union_value struct in AnyType needs to store the new type.
  • A new constructor needs to be added for the new value.
  • A new conversion function (To).
  • Modifications to ToString(), Parse(), Write(), and Read().
  • Modify the ">" and "<" operators.
ValuePair, in turn, has a creation function that takes two arguments of AnyType.

The next post moves in a different direction, by talking about a particular type of data mining model, the marginal value model. The first post discusses how the model works rather than how it is implemented.


Post a Comment

<< Home