Sunday, September 30, 2007

Weighted Average: An Example of Enhancing SQL Server Functionality

This entry discusses how to enhance SQL Server functionality by adding in a simple aggregation function, the weighted average. Ideally, we would want to use the function as follows:

SELECT AVG(purban), WAVG(purban, population)
FROM zipcensus zc

(The zipcensus table contains summaries of census information by zip code and is available at the companion page for Data Analysis Using SQL and Excel.)

This contains two unfortunatelies. First, SQL does not contain a weighted average function, so we have to define one ourselves. Second, aggregation functions cannot take two arguments, so we have to work around this. The call looks like this in practice:
SELECT AVG(purban), ud.dbo.WAVG(wval)
FROM (SELECT zc.*,
ud.dbo.CreateWeightedValue(purban, population) as wval
FROM zipcensus zc) zc

This posting references two files:
The second is the code for the new objects. The first loads these into the database.

How Wavg is Implemented
The Wavg() aggregation function is implemented as a user-defined aggregate in blog1.dll. As we'll see in the next posting, an aggregation is really a class in C# (or other .NET language) that has certain specific characteristics.

The weighted average requires two values -- the value and the weight -- to do the calculation. Unfortunately, SQL Server does not allow aggregations to take more than one argument, even user defined aggregations. Fortunately, we can define a new type, WeightedValue, that contains these two elements. This type define the following methods (attributes and functions):
  • Null (an attribute) is the NULL value for the type.
  • Value (an attribute) returns the value component.
  • Weight (an attribute) returns the weight component.
  • ToString() (a function) converts it to a string.
  • Parse() (a function) converts the string back into the type.
Note that once the type is defined, it can be used as a colum in a table or as a T-SQL variable.

The type itself does not define a function that creates the type. For this, we need a user defined function, CreateWeightedValue(). This function takes two arguments, which are both FLOATs -- one for the value and one for the weight. SQL Server converts any numeric argument to a FLOAT, so this function can take integers, floats, numerics, and so on as arguments.


Assemblies, Functions, Types and All That Jazz
SQL Server loads DLLs (dynamic link libraries) into an object called an assembly. The code for the DLL is actually placed in the database. Fortunately, the .NET architecture makes it possible to load the same code, even if the code is written on a different machine from the one where it is being run. The DLL code is not actually machine code, but a low-level language that is quickly recompiled as it is run.

An assembly can contain objects such as the following:
  • User defined functions (such as CreateWeightedValue());
  • User defined types (such as WeightedValue());
  • User defined aggregates (such as Wavg());
  • User defined procedures; and,
  • User defined table functions.
User defined types and aggregates are defined as classes in C#. User defined functions and procedures are defined as methods within a single class. For convenience, all user defined functions are defined in a class called UserDefinedFunctions.

Loading an Assembly The First Time

Although an assembly contains definitions for the objects in it, creating an assembly does not load the objects. The following are the steps for loading an assembly:

1) Create the assembly using the CREATE ASSEMBLY statement:

CREATE ASSEMBLY ud
FROM 'c:\\gordon\blog\blog1.dll'
WITH PERMISSION_SET = SAFE;

This loads the DLL stored in the location "c:\\gordon\blog\blog1.dll" into the database.

The clause "WITH PERMISSION_SET = SAFE" simply says that the DLL is very well-behaved and should not introduce any problems. This is good, because you need special admistrator priviledges to load unsafe code into the database. And, you cannot fool SQL Server (at least easily). It validates the code to see whether any unsafe features (such as unions or pointers or file i/o) are being used.

(2) Define the objects in the DLL:

CREATE TYPE WeightedValue
EXTERNAL NAME ud.WeightedValue;

GO

CREATE FUNCTION CreateWeightedValue (@val float, @wgt float)
RETURNS WeightedValue as
EXTERNAL NAME ud.UserDefinedFunctions.CreateWeightedValue
GO

CREATE AGGREGATE WAvg (@val WeightedValue)
RETURNS float
EXTERNAL NAME ud.WAvg
GO

All of these use the EXTERNAL NAME option for the CREATE command. This "links" the user defined function in SQL to the code that defines it. The external name is of the form . (for types and aggregates) or .. (for functions).


Loading an Assembly Again
Unfortunately, once an assembly is loaded and the objects defined, the following command returns an error:

DROP ASSEMBLY ud

The error is:

DROP ASSEMBLY failed because 'ud' is referenced by object 'CreateWeightedValue'.

Oops. SQL Server is smart enough to know that objects in the database rely on the assembly. Alas, it is not smart enough to drop them. For that, we need code such as the following:

IF OBJECT_ID('Wavg') IS NOT NULL
DROP AGGREGATE Wavg
GO

IF OBJECT_ID('CreateWeightedValue') IS NOT NULL
DROP FUNCTION CreateWeightedValue
GO

IF (SELECT COUNT(*) FROM sys.types WHERE UPPER(name) = 'WEIGHTEDVALUE') > 0
DROP TYPE WeightedValue
GO

IF (SELECT COUNT(*) FROM sys.assemblies WHERE UPPER(name) = 'UD') > 0
DROP ASSEMBLY ud
GO

This code checks to see if each object exists before deleting the object. This prevents errors from occuring. Notice that the TYPE is dropped after the AGGREGATE and FUNCTION. This is because the TYPE is used by them, and cannot be dropped while they reference it. All this code is in blog1-load.sql.


Using Wavg()

As mentioned at the beginning of the post, the right way to use this function is as follows:


SELECT AVG(purban), ud.dbo.WAVG(wval)
FROM (SELECT zc.*,
ud.dbo.CreateWeightedValue(purban, population) as wval
FROM zipcensus zc) zc

The innermost subquery reads from the table zipcensus, returning all the values in the row, plus a new column containing the weighted value. This is passed to the outermost query and used as the argument to the WAVG function.

By the way, the function can be used on any numeric type (integer, float, numeric, and so on), because SQL Server will convert all numeric values to FLOAT (the type of the arguments to CreateWeightedValue()).

We can verify that weighted value works, using code such as:

SELECT wval.Value, wval.Weight, wval.ToString()
FROM (SELECT ud.dbo.CreateWeightedValue(3, 4.5) as wval) a

This returns the values placed into the function.

Although this example does not require the functionality, user defined types can be placed in tables and used as variables in T-SQL code.


Friday, September 28, 2007

.NET Architecture for Extending SQL Server

I have chosen the .NET architecture for this exploration primarily because it is convenient -- I have ready access to it. It is worth pointing out that many of the ideas are independent of .NET and should be implementable in any database that has comparable facilities for extending functionality.

This post is divided into three sections: the good things about .NET; the bad things about .NET, and a brief overview of .NET archicture for my purposes.

Advantages of .NET
.NET is the Microsoft architecture for developing multi-faceted applications. This is what makes it possible to extend the database to include data mining and statistics functionality. .NET includes SQL Server, components of the operating system, and various programming languages (C# is the primary one, although it also supports Java and C++).

The biggest strength of .NET is that it supports much of what I want to do. It enables extending SQL Server with my own data types, functions, procedures, and aggregations. In addition, SQL Server is a very reasonable database for approaching analysis of moderately large sized data sets.

One big advantage of .NET is that it is single source. That is, all components are developed by Microsoft, which provides evidence that they can work together. This is a big advantage, because I want to think about databases and functionality, rather than APIs (application programming interfaces), version compatibility, installing multiple applications, and so on.

A second advantage of .NET is that it is readily available with the SQL Server database. It has a standard database platform that it interacts with.

The C# language provided with .NET is a powerful, object-oriented language. I view it as a dialect of C++, probably because I've forgotten many of the nuances of C++. The syntax is similar, although .NET provides many more built-in classes.

A fourth advantage is that .NET makes it possible to distribute both the source code and the compiled code (as dynamic load libraries better known as DLLs) regardless of the platform. Of course, .NET only runs on Microsoft operating systems. For true operating system independence, I would need to use a platform such as Java.

Less important but still notable is the fact that it has a consistent documentation interface (both on the web and with the product) and has many books written about it. For my purposes, I have extensively used C# 2005 (published by Wrox and available here) as well as the online and product documentation.

Disadvantages of .NET
.NET is sufficiently powerful for extending the functionality of a database to include data mining and statistical functionality. However, it does have some challenges.

Of course, this approach only works with SQL Server on Microsoft platforms. It is not a general purpose solution, although I do think the ideas should be readily extendible to other platforms.

In addition, there are multiple technical issues that I need to work around:

  • The maximum size of a user defined data type is 8000 bytes.
  • SQL Server does not allow overloading of function names. (That is, I cannot define two functions with the same name where one takes an integer argument and one takes a character argument; I have to give the two functions different names.)
  • SQL Server does not allow updating or modifying the database within a user defined function.
  • SQL Server does not allow aggregations to take more than one argument.
  • SQL Server does not allow functions or aggregations to take variable numbers of arguments.
  • C# has quite a few compiler directives which are needed for building libraries for SQL Server.
  • In order to be loaded as "safe" code into the database, C# is not allowed to use unions, pointers, modify the database, or do file I/O.
  • Loading assemblies (DLLs) into SQL Server is a multi-part process. The assembly is loaded and then individual elements have to be defined (although this is automated using Microsoft Visual Studio).
  • C# is not optimized for processing performance. That said, the performance seems quite reasonable to me.
  • Function calls in SQL Server seem to require a three part name . . . ... This is a minor irritation.
In many cases, there are ways around these limitations.

Overview of .NET
From my perspective, there are four components to the .NET framework:

  • SQL Server;

  • SQL Server Management Studio (for accessing the database);

  • C#; and,

  • Visual Studio (for writing C# code).
Ironically, though, the most important part is the hidden fifth component that ties these together, the .NET framework. The following picture captures my understanding of the relationships between these components. In actual fact, both SQL Server Management Studio and Visual Studio talk through the .NET interface. However, I am concerned with the interface between the database and C# for adding new functionality into the database.


Note that there are two ways that C# code can interact with the database. A C# application has full abilities to SELECT, UPDATE, and INSERT items into the database, as well as changing database parameters, creating new databases, and so on (so long as the user has the appropriate rights). Much database programming is done through such applications (which can also be written in other languagates). That is not what is happening here.

The second method is that C# can extend the functionality of the database through the creation of user defined types, functions, procedures, and aggregations. This is what I am interested in. Alas, we are not able to change the syntax of SQL. However, these elements are great building blocks for what I want to do.


Extending the functionality is simply a matter of doing the following:

  1. Creating a project in Microsoft Visual Studio to produce a DLL.

  2. Load the DLL into SQL Server as an assembly.

  3. Define the elements in the assembly.

(These last two steps can be automated when your version of Visual Studio supports SQL Server assemblies.)

The project in Visual Studio is simply a set of C# classes, with a few well placed compiler directives. Steps (2) and (3) can be done automatically, but I have a script that does the work.

The next post assumes that the Visual Studio and SQL Server environment is available and describes how to add some useful functions.

Wednesday, September 26, 2007

Extending SQL Server to Support Some Statistical and Data Mining Functionality


My most recent book, Data Mining Using SQL and Excel (order here), is about combining the power of databases and Excel for data analysis purposes. From working on that book, I have come to feel that SQL and data mining are natural allies, since both are about making sense of large amounts of data.

A surprising observation (at least to me) is that SQL operations are analogous to data mining operations. In many ways, aggregating data -- summarizing it along dimensions -- is similar to building models, since both are about capturing underlying structure in the data. And, in some cases, joining tables is similar to scoring models, since joining takes information from one row and "adds in" new information.

This idea has intrigued me since finishing the final draft. So, I decided to embark on an adventure. This adventure is to extend SQL functionality to include various types of models. My goal is to make data mining functionality a natural part of using SQL. Okay, that is a bit ambitious, because any SQL extension tends to look "grafted" onto the basic language. However, it is possible to add the concept of a "statistical model" to SQL and see where that goes.

The purpose of this blog is to capture the interesting ideas that I learn and put them in one place. I have already learned a lot about SQL, statistics, C#, and .NET programming by starting this endeavor. In addition, I also want to make the code available to other people who might find it useful.

For various reasons that I discuss in my first technical post, I have decided to implement this scenario using .NET (that is, C# and Microsoft SQL Server). By the way, this is not because of a great love for Microsoft development environments; I have very painful memories of trying to use very buggy release versions of Microsoft Visual C++ in the late 1980s. I am learning this environment "as I go", since I had never programmed in C# before April of this year.

I already have some ideas for upcoming posts:
  • Introduction to .NET for Extending SQL Server
  • Adding A Useful Function: Weighted Averages
  • Two More Useful Functions: MinOF and MaxOF
  • What is a Marginal Value Model?
  • Implementing A Basic Marginal Value Model
  • What is a Linear Regression Model?
  • Implementing A Linear Regression Model
  • Model Management and the Marginal Value Model
  • What is a Naive Bayesian Model?
  • Implementing a Naive Bayesian Model
  • What is a Survival Model?
  • Implementing a Survival Model
I do not have a schedule in mind, but this is an adventure and I'm very curious where it will lead.