Wednesday, September 26, 2007

Extending SQL Server to Support Some Statistical and Data Mining Functionality

My most recent book, Data Mining Using SQL and Excel (order here), is about combining the power of databases and Excel for data analysis purposes. From working on that book, I have come to feel that SQL and data mining are natural allies, since both are about making sense of large amounts of data.

A surprising observation (at least to me) is that SQL operations are analogous to data mining operations. In many ways, aggregating data -- summarizing it along dimensions -- is similar to building models, since both are about capturing underlying structure in the data. And, in some cases, joining tables is similar to scoring models, since joining takes information from one row and "adds in" new information.

This idea has intrigued me since finishing the final draft. So, I decided to embark on an adventure. This adventure is to extend SQL functionality to include various types of models. My goal is to make data mining functionality a natural part of using SQL. Okay, that is a bit ambitious, because any SQL extension tends to look "grafted" onto the basic language. However, it is possible to add the concept of a "statistical model" to SQL and see where that goes.

The purpose of this blog is to capture the interesting ideas that I learn and put them in one place. I have already learned a lot about SQL, statistics, C#, and .NET programming by starting this endeavor. In addition, I also want to make the code available to other people who might find it useful.

For various reasons that I discuss in my first technical post, I have decided to implement this scenario using .NET (that is, C# and Microsoft SQL Server). By the way, this is not because of a great love for Microsoft development environments; I have very painful memories of trying to use very buggy release versions of Microsoft Visual C++ in the late 1980s. I am learning this environment "as I go", since I had never programmed in C# before April of this year.

I already have some ideas for upcoming posts:
  • Introduction to .NET for Extending SQL Server
  • Adding A Useful Function: Weighted Averages
  • Two More Useful Functions: MinOF and MaxOF
  • What is a Marginal Value Model?
  • Implementing A Basic Marginal Value Model
  • What is a Linear Regression Model?
  • Implementing A Linear Regression Model
  • Model Management and the Marginal Value Model
  • What is a Naive Bayesian Model?
  • Implementing a Naive Bayesian Model
  • What is a Survival Model?
  • Implementing a Survival Model
I do not have a schedule in mind, but this is an adventure and I'm very curious where it will lead.


Post a Comment

<< Home