Sunday, September 30, 2007

Weighted Average: An Example of Enhancing SQL Server Functionality

This entry discusses how to enhance SQL Server functionality by adding in a simple aggregation function, the weighted average. Ideally, we would want to use the function as follows:

SELECT AVG(purban), WAVG(purban, population)
FROM zipcensus zc

(The zipcensus table contains summaries of census information by zip code and is available at the companion page for Data Analysis Using SQL and Excel.)

This contains two unfortunatelies. First, SQL does not contain a weighted average function, so we have to define one ourselves. Second, aggregation functions cannot take two arguments, so we have to work around this. The call looks like this in practice:
SELECT AVG(purban), ud.dbo.WAVG(wval)
FROM (SELECT zc.*,
ud.dbo.CreateWeightedValue(purban, population) as wval
FROM zipcensus zc) zc

This posting references two files:
The second is the code for the new objects. The first loads these into the database.

How Wavg is Implemented
The Wavg() aggregation function is implemented as a user-defined aggregate in blog1.dll. As we'll see in the next posting, an aggregation is really a class in C# (or other .NET language) that has certain specific characteristics.

The weighted average requires two values -- the value and the weight -- to do the calculation. Unfortunately, SQL Server does not allow aggregations to take more than one argument, even user defined aggregations. Fortunately, we can define a new type, WeightedValue, that contains these two elements. This type define the following methods (attributes and functions):
  • Null (an attribute) is the NULL value for the type.
  • Value (an attribute) returns the value component.
  • Weight (an attribute) returns the weight component.
  • ToString() (a function) converts it to a string.
  • Parse() (a function) converts the string back into the type.
Note that once the type is defined, it can be used as a colum in a table or as a T-SQL variable.

The type itself does not define a function that creates the type. For this, we need a user defined function, CreateWeightedValue(). This function takes two arguments, which are both FLOATs -- one for the value and one for the weight. SQL Server converts any numeric argument to a FLOAT, so this function can take integers, floats, numerics, and so on as arguments.


Assemblies, Functions, Types and All That Jazz
SQL Server loads DLLs (dynamic link libraries) into an object called an assembly. The code for the DLL is actually placed in the database. Fortunately, the .NET architecture makes it possible to load the same code, even if the code is written on a different machine from the one where it is being run. The DLL code is not actually machine code, but a low-level language that is quickly recompiled as it is run.

An assembly can contain objects such as the following:
  • User defined functions (such as CreateWeightedValue());
  • User defined types (such as WeightedValue());
  • User defined aggregates (such as Wavg());
  • User defined procedures; and,
  • User defined table functions.
User defined types and aggregates are defined as classes in C#. User defined functions and procedures are defined as methods within a single class. For convenience, all user defined functions are defined in a class called UserDefinedFunctions.

Loading an Assembly The First Time

Although an assembly contains definitions for the objects in it, creating an assembly does not load the objects. The following are the steps for loading an assembly:

1) Create the assembly using the CREATE ASSEMBLY statement:

CREATE ASSEMBLY ud
FROM 'c:\\gordon\blog\blog1.dll'
WITH PERMISSION_SET = SAFE;

This loads the DLL stored in the location "c:\\gordon\blog\blog1.dll" into the database.

The clause "WITH PERMISSION_SET = SAFE" simply says that the DLL is very well-behaved and should not introduce any problems. This is good, because you need special admistrator priviledges to load unsafe code into the database. And, you cannot fool SQL Server (at least easily). It validates the code to see whether any unsafe features (such as unions or pointers or file i/o) are being used.

(2) Define the objects in the DLL:

CREATE TYPE WeightedValue
EXTERNAL NAME ud.WeightedValue;

GO

CREATE FUNCTION CreateWeightedValue (@val float, @wgt float)
RETURNS WeightedValue as
EXTERNAL NAME ud.UserDefinedFunctions.CreateWeightedValue
GO

CREATE AGGREGATE WAvg (@val WeightedValue)
RETURNS float
EXTERNAL NAME ud.WAvg
GO

All of these use the EXTERNAL NAME option for the CREATE command. This "links" the user defined function in SQL to the code that defines it. The external name is of the form . (for types and aggregates) or .. (for functions).


Loading an Assembly Again
Unfortunately, once an assembly is loaded and the objects defined, the following command returns an error:

DROP ASSEMBLY ud

The error is:

DROP ASSEMBLY failed because 'ud' is referenced by object 'CreateWeightedValue'.

Oops. SQL Server is smart enough to know that objects in the database rely on the assembly. Alas, it is not smart enough to drop them. For that, we need code such as the following:

IF OBJECT_ID('Wavg') IS NOT NULL
DROP AGGREGATE Wavg
GO

IF OBJECT_ID('CreateWeightedValue') IS NOT NULL
DROP FUNCTION CreateWeightedValue
GO

IF (SELECT COUNT(*) FROM sys.types WHERE UPPER(name) = 'WEIGHTEDVALUE') > 0
DROP TYPE WeightedValue
GO

IF (SELECT COUNT(*) FROM sys.assemblies WHERE UPPER(name) = 'UD') > 0
DROP ASSEMBLY ud
GO

This code checks to see if each object exists before deleting the object. This prevents errors from occuring. Notice that the TYPE is dropped after the AGGREGATE and FUNCTION. This is because the TYPE is used by them, and cannot be dropped while they reference it. All this code is in blog1-load.sql.


Using Wavg()

As mentioned at the beginning of the post, the right way to use this function is as follows:


SELECT AVG(purban), ud.dbo.WAVG(wval)
FROM (SELECT zc.*,
ud.dbo.CreateWeightedValue(purban, population) as wval
FROM zipcensus zc) zc

The innermost subquery reads from the table zipcensus, returning all the values in the row, plus a new column containing the weighted value. This is passed to the outermost query and used as the argument to the WAVG function.

By the way, the function can be used on any numeric type (integer, float, numeric, and so on), because SQL Server will convert all numeric values to FLOAT (the type of the arguments to CreateWeightedValue()).

We can verify that weighted value works, using code such as:

SELECT wval.Value, wval.Weight, wval.ToString()
FROM (SELECT ud.dbo.CreateWeightedValue(3, 4.5) as wval) a

This returns the values placed into the function.

Although this example does not require the functionality, user defined types can be placed in tables and used as variables in T-SQL code.


2 Comments:

Blogger MrFrijole said...

How do you make this work with null values? Meaning how do you declare a null WeightedValue type? I need this functionality, and you're the only one that I've found has a solution.

December 16, 2008 2:38 PM  
Blogger Iggy said...

if I had to do a subquery for weighted average, wouldn't it be easier to just sum(weight*value)/sum(weight)?
I've been searching on the net for a way to simplify that query into wavg(weight,value). Alas, thanks for the post, it was informative.

October 20, 2009 4:15 PM  

Post a Comment

<< Home