<?xml version='1.0' encoding='UTF-8'?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/'><id>tag:blogger.com,1999:blog-8614680675185707253</id><updated>2007-12-06T19:28:52.010-05:00</updated><title type='text'>Data Mining in SQL Server</title><link rel='alternate' type='text/html' href='http://www.data-miners.com/dataminingsqlserver/'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8614680675185707253/posts/default'/><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://www.data-miners.com/dataminingsqlserver/atom.xml'/><author><name>Gordon S. Linoff</name></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>11</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-8614680675185707253.post-4633672163840190972</id><published>2007-11-30T16:27:00.000-05:00</published><updated>2007-12-06T19:28:52.057-05:00</updated><title type='text'>Naive Bayesian Models (Part 1)</title><content type='html'>[This post is part of a series where I'm exploring how to add data mining functionality to the SQL language; this is an extension of my most recent book &lt;a href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;Data Analysis With SQL and Excel&lt;/a&gt;. The first post is available &lt;a href="http://www.data-miners.com/dataminingsqlserver/2007/09/extending-sql-server-to-support-some.html"&gt;here&lt;/a&gt;.]&lt;br /&gt;&lt;br /&gt;&lt;script type="text/javascript"&gt;&lt;br /&gt;&lt;br /&gt;_uacct = "UA-380835-1";&lt;br /&gt;&lt;br /&gt;urchinTracker();&lt;br /&gt;&lt;br /&gt;&lt;/script&gt;&lt;br /&gt;The previous posts have shown how to extend SQL Server to support some basic modeling capabilities.  This post and the next post add a new type of model, the naive Bayesian model, which is actually quite similar to the marginal value model discussed earlier.&lt;br /&gt;&lt;br /&gt;This post explains some of the mathematics behind the model.  A more thorough discussion is available in my book &lt;a href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;Data Analysis Using SQL and Excel&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:arial;font-size:180%;color:#006600;"&gt;&lt;strong&gt;What Does A Naive Bayesian Model Do?&lt;/strong&gt;&lt;/span&gt;&lt;br /&gt;A naive Bayesian model calculates a probability by combining summary information along different dimensions.&lt;br /&gt;&lt;br /&gt;This is perhaps best illustrated by an example.  Say that we have a business where 55% of customers survive for the first year.  Say that male customers have a 60% probability of remaining a customer after one year and that California customers have an 80% probability.  What is the probability that a male customer from California will survive the first year?&lt;br /&gt;&lt;br /&gt;The first thing to note is that the question has no correct answer; perhaps men in California are quite different from men elsewhere.  The answer could be any number between 0% and 100%.&lt;br /&gt;&lt;br /&gt;The second thing to note is the structure of the problem.  We are looking for a probability for the intersection of two dimensions (gender and state).  To solve this, we have:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The overall probability for the population (55%).&lt;/li&gt;&lt;li&gt;The probability along each dimension (60% and 80%).&lt;/li&gt;&lt;/ul&gt;The native Bayesian model combines this information, by making an assumption (which may or may not be true).  In this case, the answer is that a male from California has an 83.1% probability for surviving the first year.&lt;br /&gt;&lt;br /&gt;The naive Bayesian model can handle any number of dimensions.  However, it is always calculating a probability using information about the probabilities along each dimension individually.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:arial;font-size:180%;color:#006600;"&gt;&lt;strong&gt;Probabilities and Likelihoods&lt;/strong&gt;&lt;/span&gt;&lt;br /&gt;Value of 83.1% may seem surprising.  Many people's intuition would put the number between 60% and 80%.  Another way of looking at the problem, though, might make this clearer.  Being male makes a customer more likely to stay for a year.  Being from California also makes a customer even more likely to stay.  Combining the information on the two dimensions should be stronger than either dimension individually.&lt;br /&gt;&lt;br /&gt;It is one thing to explain this in words.  Modeling and data mining requires explaining things with formulas.  The problem is about probabilities, but the solution uses a related concept.&lt;br /&gt;&lt;br /&gt;The likelihood has a simple formula:  &lt;span style="font-family:courier new;"&gt;likelihood = p / (1-p)&lt;/span&gt;,  where &lt;span style="font-family:courier new;"&gt;p&lt;/span&gt; is the probability.  That is, it is the ratio of the probability of something happening to its not happening.  Where the probability varies from 0% to 100%, the likelihood varies from zero to infinity.  Also, given a likelihood, the probability is easily calculated:  &lt;span style="font-family:courier new;"&gt;p = 1 - (1/(1+likelihood))&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;The likehood is also known as the &lt;em&gt;odds&lt;/em&gt;.  When we say something has 1 in 9 odds, we mean that something happens one time for every nine times it does not happen.  Another way of saying this is that the probability is 10%.&lt;br /&gt;&lt;br /&gt;For instance, for the following are the likelihoods for the simple problem being discussed:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;overall likelihood (p = 55%) = 1.22;&lt;/li&gt;&lt;li&gt;male likelihood (p = 60%) = 1.50; and,&lt;/li&gt;&lt;li&gt;California likelihood (p = 80%) = 4.00.&lt;/li&gt;&lt;/ul&gt;Notice that the likelihoods vary more dramatically than the probabilities.  That is, 80% is just a bit more than 60%, but 4.0 is much larger than 1.5.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;&lt;span style="font-family:arial;font-size:180%;color:#006600;"&gt;The Naive Bayesian Formula&lt;/span&gt;&lt;/strong&gt;&lt;br /&gt;The formula for the naive Bayesian model uses one more concept, the &lt;em&gt;likelihood ratio&lt;/em&gt;.  This is the ratio of any given likelihood to the overall likelihood.  This ratio also varies from zero to infinity.  When the likelihood ratio is greater than one, then something is more likely to occur than on average for everyone (such as the case with both males and Californians).&lt;br /&gt;&lt;br /&gt;The formula for the naive Bayesian model says the following:  the overall likelihood of something occurring along multiple dimensions is the overall likelihood times the likelood ratios along each dimension.&lt;br /&gt;&lt;br /&gt;For the example, the formula produces:  &lt;span style="font-family:courier new;"&gt;1.22*(1.5/1.22)*(4.0/1.22)=4.91&lt;/span&gt;.  When converted back to a probability this produces 83.1%.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:arial;font-size:180%;color:#006600;"&gt;&lt;strong&gt;What Does the Naive Assumption Really Mean?&lt;/strong&gt;&lt;/span&gt;&lt;br /&gt;The "Bayesian" in "naive Bayesian" refers to a basic probability formula devised by Rev. Thomas Bayes in the early 1700s.  This probability formula is used to devise the formula described above.&lt;br /&gt;&lt;br /&gt;The "naive" in naive Bayesian refers to a simple assumption.  This is the assumption that the information along the two dimensions is independent.  This is the same assumption that we made for the marginal value model.  In fact, the two models are very similar.  Both combine information along dimensions into a single value.  In the first case, it is counts.  In the second case, it is probabilities.&lt;br /&gt;&lt;br /&gt;In the real world, it is unusual to find dimensions that are truly independent.  However, the naive Bayesian approach can still work well in practice.  Often, we do not need the actual probabilities.  It is sufficient to have relative measures (males from California are better risks than females from Nevada, for instance).&lt;br /&gt;&lt;br /&gt;If we further analyzed the data or did a test and learned that males from California really survived at only a 40% rate instead of 83.1%, then this fact would be evidence that state and gender are not independent.  The solution is simply to replace state and gender by a single category that combines the two:  California-male, California-female, Nevada-male, and so on.&lt;br /&gt;&lt;br /&gt;One of the nice features of these models is that they can use a large number of features of the data and readily handle missing information (the likelihood value for a dimension that is missing is simply not included in the equation).  This makes them feasible for some applications such as classifying text, which other techniques do not work so well on.  It also makes it possible to calculate a probability for a combination of dimensions which has never been seen before -- made possible by the naive assumption.&lt;br /&gt;&lt;br /&gt;The next posting contains the code for a basic naive Bayesian model in SQL Server.</content><link rel='alternate' type='text/html' href='http://www.data-miners.com/dataminingsqlserver/2007/11/naive-bayesian-models-part-1.html' title='Naive Bayesian Models (Part 1)'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8614680675185707253&amp;postID=4633672163840190972' title='0 Comments'/><link rel='replies' type='application/atom+xml' href='http://www.data-miners.com/dataminingsqlserver/atom.xml' title='Post Comments'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8614680675185707253/posts/default/4633672163840190972'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8614680675185707253/posts/default/4633672163840190972'/><author><name>Gordon S. Linoff</name></author></entry><entry><id>tag:blogger.com,1999:blog-8614680675185707253.post-1715653351456959565</id><published>2007-11-24T11:12:00.001-05:00</published><updated>2007-11-25T21:49:10.236-05:00</updated><title type='text'>Managing SQL Server Extensions (Functions, Types, Etc.)</title><content type='html'>&lt;script src="http://www.google-analytics.com/urchin.js" type="text/javascript"&gt;&lt;/script&gt;[This post is part of a series where I'm exploring how to add data mining functionality to the SQL language; this is an extension of my most recent book &lt;a href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;Data Analysis With SQL and Excel&lt;/a&gt;. The first post is available &lt;a href="http://www.data-miners.com/dataminingsqlserver/2007/09/extending-sql-server-to-support-some.html"&gt;here&lt;/a&gt;.]&lt;br /&gt;&lt;br /&gt;Up to now, I have discussed adding functions, types, and aggregates into SQL Server.  The code is created in C# and loaded into SQL Server using as an assembly. Placing the code into SQL Server has four steps:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Deleting all objects already defined in the assembly (if any).&lt;/li&gt;&lt;li&gt;Deleting the assembly (if present).&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Loading a new version of the assembly.&lt;/li&gt;&lt;li&gt;Redefining the objects in the assembly.&lt;/li&gt;&lt;/ol&gt;For readers who are familiar with the process of compiling and linking code, this process is similar to linking.  The references in the assembly have to be "linked" into the database, so the database knows what the references refer to.&lt;br /&gt;&lt;br /&gt;I am doing this process manually for two reasons.  First, because this is how I originally set up this project for adding data mining functionality into SQL Server (even though Visual Studio does have options for doing this automatically).  Second, this approach provides an opportunity to start to understand how SQL Server manages user defined types.&lt;br /&gt;&lt;br /&gt;This post discusses how to manage the first of these steps automatically.  That is, it describes how to delete all objects in a database referenced by a particular assembly.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style=";font-family:arial;font-size:180%;"  &gt;&lt;span style="color: rgb(0, 102, 0); font-weight: bold;"&gt;A Common Error&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;The following code drops the user defined function &lt;span style="font-family:courier new;"&gt;CreateBasicMarginalValueModel()&lt;/span&gt;:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;DROP AGGREGATE CreateBasicMarginalValueModel&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;GO&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This expression is quite simple.  However, if it is executed twice, it returns the error:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;Msg 3701, Level 11, State 5, Line 1&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;Cannot drop the aggregate function 'createbasicmarginalvaluemodel', because it does not exist or you do not have permission.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This is inelegant, because it means that we cannot run the same code to drop a function twice in a row.  Even if it works the first time, the second time it runs, the same code will return an error.  Furthermore, when we see this error, we do not know if the problem is the non-existance of the object or inadequate database permissions.&lt;br /&gt;&lt;br /&gt;To fix this, we use the T-SQL &lt;span style="font-family:courier new;"&gt;IF&lt;/span&gt; construct:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;IF OBJECT_ID('CreateBasicMarginalValueModel') IS NOT NULL&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);font-family:courier new;" &gt;....&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;DROP AGGREGATE CreateBasicMarginalValueModel&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;GO&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This simply says that if the object exists, then drop the aggregate.  However, this is inelegant, because it mentions the name of the aggregate function twice, once in the "if" clause and once when dropping it.  In addition, we do not want to  have to explicitly mention every object by name, since we may not know which objects in the assembly were actually declared in the database.&lt;br /&gt;&lt;br /&gt;&lt;span style=";font-family:arial;font-size:180%;"  &gt;&lt;span style="color: rgb(0, 102, 0); font-weight: bold;"&gt;Handling Dependencies&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;Another problem occurs when we try to drop a type.  The following statement:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;IF (SELECT COUNT(*) FROM sys.types&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;WHERE UPPER(name) = 'BASICMARGINALVALUEMODEL') &gt; 0 &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;DROP TYPE BasicMarginalValueModel&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;GO&lt;/span&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Returns the enigmatic error:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;Msg 3732, Level 16, State 1, Line 3&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;Cannot drop type 'BasicMarginalValueModel' because it is currently in use.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This error does not, unfortunately, tell us who or what is using the type.  In this case, it is the set of functions, procedures, aggregates, and other types that use the type as an argument or return value.  We have to remove all these objects before we can remove the type.&lt;br /&gt;&lt;br /&gt;In general, we need to remove functions, aggregates, and procedures before we remove types.  This ensures that the types have no dependencies on them, so they can be removed cleanly from the database.&lt;br /&gt;&lt;br /&gt;This problem with dependencies is actually an advantage.  It ensures that code loaded into the database all refers to the proper set of definitions.  If a function uses a type, we cannot simply replace the type.  We need to drop the function, drop the type, and then re-declare the type and function.  This ensures that the function refers to the proper code when using the type.&lt;br /&gt;&lt;span style=";font-family:arial;font-size:180%;"  &gt;&lt;br /&gt;&lt;span style="color: rgb(0, 102, 0); font-weight: bold;"&gt;Finding All User-Defined Functions in an Assembly&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;The first step to removing a certain class of objects, say functions, is to find all them in the database.  They are conveniently located in the &lt;span style="font-family:courier new;"&gt;sys.objects&lt;/span&gt; table, so the following query returns all user-defined functions:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;SELECT o.name&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;FROM sys.objects o &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;WHERE o.type in ('FS', 'FT')&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The &lt;span style="font-family:courier new;"&gt;sys.objects&lt;/span&gt; table contains references to many different objects in the database (almost everything except user defined types).  The specific type abbreviations 'FS' and 'FT' refer to scalar functions and table functions, respectively.&lt;br /&gt;&lt;br /&gt;The only problem with this code fragment is that returns &lt;span style="font-style: italic;"&gt;all&lt;/span&gt; user defined functions, there might be user defined functions from different assemblies.  What we really want are only user defined functions in the "ud" assembly.  To find this, we have to use two more reference tables.  To get all the functions in "ud", the query looks like:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;SELECT o.name&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;FROM sys.objects o JOIN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;sys.assembly_modules am&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;ON o.object_id = am.object_id JOIN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;sys.assemblies a&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;ON am.assembly_id = a.assembly_id&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;WHERE a.name = 'ud' and &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;o.type in ('FS', 'FT')&lt;/span&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;This finds all user defined functions only in the desired assembly.  The code for procedures and aggregates is quite similar.  The only difference is that the type in the &lt;span style="font-family:courier new;"&gt;WHERE&lt;/span&gt; clause matches 'PC' and 'AF', respectively.&lt;br /&gt;&lt;br /&gt;User defined types are somewhat different.  They are stored in the table &lt;span style="font-family:courier new;"&gt;sys.types&lt;/span&gt;, rather than &lt;span style="font-family:courier new;"&gt;sys.objects&lt;/span&gt;.  The query to find all of them is similar, requiring looking up assembly information in additional tables:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;SELECT t.name&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;FROM sys.types t JOIN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;sys.type_assembly_usages tau&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;ON t.user_type_id = tau.user_type_id JOIN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;sys.assemblies a&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;ON tau.assembly_id = a.assembly_id&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;WHERE a.name = 'ud' &lt;/span&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Although the query is somewhat different, it returns the name of the user defined types in the given assembly.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style=";font-family:arial;font-size:180%;"  &gt;&lt;span style="color: rgb(0, 102, 0); font-weight: bold;"&gt;Deleting All User Defined Functions&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;Going from a query that returns a list of user-defined functions (or whatever) to actions on those functions (such as dropping them) requires using the T-SQL command language.  In particular, we need to define cursors on the query, so we can do something to each row.&lt;br /&gt;&lt;br /&gt;Code that uses cursors has the following structure:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;DECLARE @name VARCHAR(2000)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;DECLARE the_cursor CURSOR FOR&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;span style="color: rgb(0, 0, 0); font-style: italic;"&gt;QUERY&lt;/span&gt;&lt;span style="color: rgb(0, 0, 0);"&gt;&lt;query&gt;&lt;/query&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;query&gt;&lt;/query&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;OPEN the_cursor&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;FETCH next FROM the_cursor INTO @name&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;WHILE @@fetch_status = 0&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;BEGIN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;span style="color: rgb(0, 0, 0); font-style: italic;"&gt;DO ACTION HERE&lt;/span&gt;&lt;span style="color: rgb(0, 0, 0);"&gt;&lt;do&gt;&lt;/do&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;action&gt;&lt;/action&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;FETCH NEXT FROM the_cursor INTO @name&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;END&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;CLOSE the_cursor&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;DEALLOCATE the_cursor&lt;/span&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;The first two lines of the code declare two variables.  The first is a standard scalar variable, which is used to store each value returned by the query.  The second is a cursor, which is used to cycle through the rows.  Notice that the cursor variable is not preceded by an at sign.&lt;br /&gt;&lt;br /&gt;Most of the remaining code is the framework used to manage the cursor.  It is important to handle cursors correctly.  A simple mistake -- such as leaving out the &lt;span style="font-family:courier new;"&gt;FETCH NEXT FROM&lt;/span&gt; -- can result in an infinite loop.  We do not want that to happen.&lt;br /&gt;&lt;br /&gt;Opening the cursor runs the query and the &lt;span style="font-family:courier new;"&gt;FETCH NEXT&lt;/span&gt; statement gets the next value, which is placed in the local variable &lt;span style="font-family:courier new;"&gt;@name&lt;/span&gt;.  When there are no more values, the cursor is closed an deallocated.&lt;br /&gt;&lt;br /&gt;The full code for dropping all functions is a bit longer, because the query and action portions are filled in:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;DECLARE @function_name VARCHAR(2000)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;DECLARE function_cursor CURSOR FOR&lt;/span&gt;&lt;br /&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;SELECT o.name&lt;/span&gt;&lt;br /&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;FROM sys.objects o JOIN&lt;/span&gt;&lt;br /&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;sys.assembly_modules am&lt;/span&gt;&lt;br /&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;ON o.object_id = am.object_id JOIN&lt;/span&gt;&lt;br /&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;sys.assemblies a&lt;/span&gt;&lt;br /&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;ON a.assembly_id = a.assembly_id&lt;/span&gt;&lt;br /&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;WHERE a.name = 'udf' AND&lt;/span&gt;&lt;br /&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;o.type in ('FS', 'FT')&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;OPEN function_cursor&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;    &lt;/span&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;FETCH next FROM function_cursor INFO @function_name&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;    &lt;/span&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;WHILE @@fetch_status = 0&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;    &lt;/span&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;BEGIN&lt;/span&gt;&lt;br /&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;EXEC('DROP FUNCTION '+@function_name)&lt;/span&gt;&lt;br /&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;FETCH NEXT FROM function_cursor INTO @function_name&lt;/span&gt;&lt;br /&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;END&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;CLOSE function_cursor&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;DEALLOCATE function_cursor&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;PRINT 'DROPPED FUNCTIONS'&lt;/span&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;The cursor is defined over the query that returns all the functions in the "ud" assembly.  For each of these function, the action is to drop the function.  The action uses the &lt;span style="font-family:courier new;"&gt;EXEC()&lt;/span&gt; function rather than just the &lt;span style="font-family:courier new;"&gt;DROP FUNCTION&lt;/span&gt; statement.  The &lt;span style="font-family:courier new;"&gt;EXEC()&lt;/span&gt; function takes a string as an argument, and executes the string as a T-SQL statement.  This makes it possible to incorporate the name of the function into the command.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; color: rgb(0, 102, 0);font-size:180%;" &gt;&lt;span style="font-family:arial;"&gt;The Full Code&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;The code for dropping aggregates, procedures, and types follows the same structure as the code for dropping functions.  The only differences are to the query that defines the cursor and the string passed to the &lt;span style="font-family:courier new;"&gt;EXEC()&lt;/span&gt; function (&lt;span style="font-family:courier new;"&gt;DROP AGGREGATE&lt;/span&gt;, &lt;span style="font-family:courier new;"&gt;DROP PROCEDURE&lt;/span&gt;, or &lt;span style="font-family:courier new;"&gt;DROP TYPE&lt;/span&gt;).&lt;br /&gt;&lt;br /&gt;The only important aspect to the code is that types need to be dropped last, because of the dependency problem.&lt;br /&gt;&lt;br /&gt;This entry does not include the T-SQL code for this example.   The next entry discusses naive Bayesian models.  The entry after that will include code that has these enhancements.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;script type="text/javascript"&gt;&lt;br /&gt;_uacct = "UA-380835-1";&lt;br /&gt;urchinTracker();&lt;br /&gt;&lt;/script&gt;</content><link rel='alternate' type='text/html' href='http://www.data-miners.com/dataminingsqlserver/2007/11/managing-sql-server-extensions.html' title='Managing SQL Server Extensions (Functions, Types, Etc.)'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8614680675185707253&amp;postID=1715653351456959565' title='0 Comments'/><link rel='replies' type='application/atom+xml' href='http://www.data-miners.com/dataminingsqlserver/atom.xml' title='Post Comments'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8614680675185707253/posts/default/1715653351456959565'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8614680675185707253/posts/default/1715653351456959565'/><author><name>Gordon S. Linoff</name></author></entry><entry><id>tag:blogger.com,1999:blog-8614680675185707253.post-7856584213528356087</id><published>2007-11-10T14:57:00.000-05:00</published><updated>2007-11-10T19:19:31.219-05:00</updated><title type='text'>Marginal Value Models: C# Table Valued Functions (Part 3)</title><content type='html'>&lt;script src="http://www.google-analytics.com/urchin.js" type="text/javascript"&gt;&lt;/script&gt;[This post is part of a series where I'm exploring how to add data mining functionality to the SQL language; this is an extension of my most recent book &lt;a href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;Data Analysis With SQL and Excel&lt;/a&gt;. The first post is available &lt;a href="http://www.data-miners.com/dataminingsqlserver/2007/09/extending-sql-server-to-support-some.html"&gt;here&lt;/a&gt;.]&lt;br /&gt;&lt;br /&gt;The previous two posts introduce marginal value models.  Underlying these models is a table of values.  This post discusses how this table can be returned in SQL Server.  In other words, this post discusses table valued functions.&lt;br /&gt;&lt;br /&gt;For reference, the files associates with the model are available at:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.data-miners.com/dataminingsqlserver/blog3enum.cs"&gt;blog3enum.cs&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.data-miners.com/dataminingsqlserver/blog3enum.dll"&gt;blog3enum.dll&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.data-miners.com/dataminingsqlserver/blog3enum-load.sql"&gt;blog3enum-load.sql&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;The first two files contain the DLL and SQL code for loading functionality into SQL Server. The third file contains the source code for the functionality.  These files are slightly different from the previous blog3 files, since I fixed some errors in them.&lt;br /&gt;&lt;br /&gt;&lt;span style=";font-family:arial;font-size:180%;"  &gt;&lt;span style="color: rgb(0, 102, 0); font-weight: bold;"&gt;What Are Table Valued Functions?&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;In earlier posts, I introduce user defined functions in SQL.  These functions have all been scalar functions, whether implemented as user defined functions or as methods in a user defined type.  SQL Server also supports user defined table valued functions.  The purpose here is to return all the values in a &lt;span style="font-family:courier new;"&gt;BasicMarginalValueModel&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;The following T-SQL code shows an example for this:&lt;br /&gt;&lt;p&gt;&lt;span style="font-family:courier new;"&gt;DECLARE @mod dbo.BasicMarginalValueModel &lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:courier new;"&gt;SELECT @mod = ud.dbo.CreateBasicMarginalValueModel(arg)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;FROM (SELECT TOP 100&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.............&lt;/span&gt;ud.dbo.MarginalValueArgs2(zc.population,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;...............    .&lt;/span&gt;zc.hhmedincome, 1) as arg&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;FROM sqlbook..zipcensus zc) zc&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:courier new;"&gt;SELECT m.mvme.ToString()&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;FROM ud.dbo.MarginalValues(@mod) m&lt;/span&gt;&lt;/p&gt;The first statement declares a variable called &lt;span style="font-family:courier new;"&gt;@mod&lt;/span&gt; as a &lt;span style="font-family:courier new;"&gt;BasicMarginalValueModel&lt;/span&gt;.   The second assigns this variable a value, using the first 100 rows of the table zipcensus (provided on the companion page to the book &lt;span style="font-style: italic; font-weight: bold;"&gt;Data Analysis Using SQL and Excel&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;The third statement calls the table valued function &lt;span style="font-family:courier new;"&gt;MarginalValues()&lt;/span&gt;.  This function returns the values stored in each cell of the model.  So, if there are two dimensions and each has ten values, then this returns twenty rows.  Of course, because there are more than one value in the table (a string and a value), a new data type is needed to store these values.  This data type is called &lt;span style="font-family:courier new;"&gt;MarginalValueModelElement&lt;/span&gt;.  The attached files contain the definitions for these functions and types.&lt;br /&gt;&lt;br /&gt;A second table valued function is also defined for the type.  This function is called &lt;span style="font-family:courier new;"&gt;AllCells() &lt;/span&gt;and it returns all combinations of the cells.  So, if there are ten values along two dimensions, this function returns one hundred rows, one for each combination of the two values.  This function also shows that it is possible to have more than one table valued function within a given model.&lt;br /&gt;&lt;br /&gt;&lt;span style=";font-family:arial;font-size:180%;"  &gt;&lt;span style="color: rgb(0, 102, 0); font-weight: bold;"&gt;Defining Table Valued Function in T-SQL&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;Table valued functions have to be declared in T-SQL.  The definition is an extension of the definition of scalar valued functions.&lt;br /&gt;&lt;br /&gt;The &lt;span style="font-family:courier new;"&gt;MarginalValues()&lt;/span&gt; function returns a specific type, so this needs to be declared.  This is simply:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;CREATE TYPE MarginalValueModelElement&lt;br /&gt;EXTERNAL NAME ud.MarginalValueModelElement&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;GO&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The function itself uses the code:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;CREATE FUNCTION MarginalValues(@arg BasicMarginalValueModel)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;RETURNS TABLE (mvme MarginalValueModelElement)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;AS EXTERNAL NAME ud.BasicMarginalValueModel.InitMarginalValueEnumerator &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;GO&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;First, notice that table valued functions follow use the same keyword as scalar functions.  The difference is the use of &lt;span style="font-family:courier new;"&gt;RETURNS TABLE&lt;/span&gt; rather than just &lt;span style="font-family:courier new;"&gt;RETURNS&lt;/span&gt;.  After this keyword comes the table definition.  Table valued functions can only return tables with one column.  I am not sure if this is a limitation of SQL Server or a C# limitation (table valued functions are implemented as enumerators in C#).&lt;br /&gt;&lt;br /&gt;Second, notice that the table valued function is actually defined within the type &lt;span style="font-family:courier new;"&gt;BasicMarginalValueModel&lt;/span&gt;.  Scalar functions defined within a type do not need explicit declarations; however, table functions do.  Although the function is defined within the type, it is defined as static, so it still needs to take the model as an argument.  In fact, all user defined function declared explicitly in SQL Server must be static, both scalar and table functions.&lt;br /&gt;&lt;br /&gt;Notice that the function definition defines the name of the colum as &lt;span style="font-family:courier new;"&gt;mvme&lt;/span&gt;.  In the previous code, this column name is used to access values.&lt;br /&gt;&lt;br /&gt;Within SQL Server, scalar functions and table valued functions are stored separately.  After loading blog3enum.dll using blog3enum-load.sql (two files mentioned at the top of this post), the following are in SQL Server:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.data-miners.com/dataminingsqlserver/uploaded_images/blog3-management-studio-718327.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://www.data-miners.com/dataminingsqlserver/uploaded_images/blog3-management-studio-718324.jpg" alt="" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;(I apologize for the small size of this image; I do not know how to make it larger.)&lt;br /&gt;Notice that SQL Server has separate areas for scalar functions and table-valued functions.  I find this ironic, since the metadata stores them in the same way.&lt;br /&gt;&lt;br /&gt;&lt;span style=";font-family:arial;font-size:180%;"  &gt;&lt;span style="color: rgb(0, 102, 0); font-weight: bold;"&gt;The Primitives for Implementing Them in C#&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;The C# code for table valued functions is basically the code for user defined enumerators.  A user defined enumerator is something that you use for the &lt;span style="font-family:courier new;"&gt;foreach&lt;/span&gt; statement.&lt;br /&gt;&lt;br /&gt;There are three steps for creating a user-defined enumerator in C#:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Declare the class to be an instance of &lt;span style="font-family:courier new;"&gt;System.Collections.IEnumerable&lt;/span&gt;.&lt;/li&gt;&lt;li&gt;Declare the two enumeration functions.&lt;/li&gt;&lt;li&gt;Declare the enumeration class that does all the work.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;The next three sections discuss these in a bit more detail.&lt;br /&gt;&lt;br /&gt;&lt;span style=";font-family:arial;font-size:180%;"  &gt;&lt;span style="color: rgb(0, 102, 0); font-weight: bold;"&gt;&lt;span style="font-family:courier new;"&gt;IEnumerable&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;IEnumerate&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;Declaring a table valued function requires declaring a user defined enumeration, and this in turn requires using two underlying classes.  The distinction between these two classes is a bit subtle and confusing, although the ideas are not really difficult.&lt;br /&gt;&lt;br /&gt;The first class is the &lt;span style="font-family:courier new;"&gt;IEnumerable&lt;/span&gt; class.  This class says "hey, I'm a class that supports &lt;span style="font-family:courier new;"&gt;foreach&lt;/span&gt;".   We need it, because such classes are actually what table-valued functions are.  And this makes sense.  A table valued function has a bunch of rows that are returned one-by-one.  The &lt;span style="font-family:courier new;"&gt;foreach&lt;/span&gt; clause does the same thing in C#.&lt;br /&gt;&lt;br /&gt;The second class is &lt;span style="font-family:courier new;"&gt;IEnumerate&lt;/span&gt;, which we will see used below.  This class is not a declaration of an external interface.  Instead, it is used in the bowels of the &lt;span style="font-family:courier new;"&gt;foreach&lt;/span&gt;.  It maintains the state needed to fetch the next value.&lt;br /&gt;&lt;br /&gt;I would like to add one more comment about table valued functions.  Unlike aggregation functions, they do not seem to support a parallel interface.  This is unfortunate, since this limits the scalability of code using them.&lt;br /&gt;&lt;br /&gt;&lt;span style=";font-family:arial;font-size:180%;"  &gt;&lt;span style="color: rgb(0, 102, 0); font-weight: bold;"&gt;Declaring SQL Table Functions&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;Two functions are needed to define a table valued function.  The first is the enumeration function and the second is a helper function that "fills" a row.  These two function are defined as follows:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;[SqlFunction(FillRowMethodName = "BVMEnumeratorFillRow")]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;public static BVMMElementEnumerator&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;InitMarginalValueEnumerator (BVMM csm)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;{&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;return new BVMMElementEnumerator(csm);&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;}&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;public static void&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;BVMMEnumeratorFillRow (Object row,&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;........&lt;/span&gt;out MarginalValueModelElement evme)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;{&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;evme = (MarginalValueModelElement)row;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;}  // BasicMarginalValueModelEnumeratorFillRow()&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;(In this code, I have used &lt;span style="font-family:courier new;"&gt;BVMM&lt;/span&gt; for &lt;span style="font-family:courier new;"&gt;BasicMarginalValueModel&lt;/span&gt; so the code formats more easily.)&lt;br /&gt;&lt;br /&gt;The first of these functions is the reference used in the &lt;span style="font-family:courier new;"&gt;CREATE FUNCTION&lt;/span&gt; statement.  This uses a compiler directive, specific for the SQL Server interface.  This directive simply says that the function to call to retrieve each row is called &lt;span style="font-family:courier new;"&gt;BVMEnumeratorFillRow&lt;/span&gt;.  Not surprisingly, this is the other function.&lt;br /&gt;&lt;br /&gt;The first function returns the enumerator.  This is a special class that stores state between calls to the enumerator.  This is discussed in the next section.&lt;br /&gt;&lt;br /&gt;The underlying C# routines that do the enumerations use very general code that works in terms of objects and that has nothing to do with SQL Server.  The interface to SQL Server uses the fill-row routine, which simply copies the appropriate values into the row, and this is handled by casting the object to the appropriate type.&lt;br /&gt;&lt;br /&gt;&lt;span style=";font-family:arial;font-size:180%;"  &gt;&lt;span style="color: rgb(0, 102, 0); font-weight: bold;"&gt;Defining the Enumeration Class&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;The enumeration class is the most complex part of the definition.  However, in this case, the code is rather simple, because it accesses an underlying enumerator used for the &lt;span style="font-family:courier new;"&gt;Dictionary&lt;/span&gt; class.&lt;br /&gt;&lt;br /&gt;First a word about the class used for the &lt;span style="font-family: courier new;"&gt;MarginalValues()&lt;/span&gt; SQL function.  It is called &lt;span style="font-family: courier new;"&gt;BasicMarginalValueModelElementEnumerator&lt;/span&gt;.  The connection between the function in SQL and this class is not readily apparent.  It requires looking at the C# code that defines the C# fucntion used to define &lt;span style="font-family: courier new;"&gt;MarginalValues()&lt;/span&gt;.  This fucntion is called &lt;span style="font-family: courier new;"&gt;InitMarginalValueEnumerator()&lt;/span&gt;  and it creates an instance of this enumeration class.&lt;br /&gt;&lt;br /&gt;So, the class must be defined to inherit from &lt;span style="font-family:courier new;"&gt;System.Collections.IEnumerator&lt;/span&gt;; this sets it up to have the appropriate interface for an enumeration.&lt;br /&gt;&lt;br /&gt;This class contains the following elements:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;A private member to store the state.  This is an instance of the class &lt;span style="font-family:courier new;"&gt;System.Collections.IEnumerator&lt;/span&gt;.&lt;/li&gt;&lt;li&gt;A constructor, which assigns the enumerator from the dictionary to the private member.&lt;/li&gt;&lt;li&gt;A &lt;span style="font-family:courier new;"&gt;MoveNext()&lt;/span&gt; function that goes to the next element in the list.  This simply calls the dictionary function.&lt;/li&gt;&lt;li&gt;A &lt;span style="font-family:courier new;"&gt;Reset()&lt;/span&gt; function that starts over at the beginning.&lt;/li&gt;&lt;li&gt;A &lt;span style="font-family:courier new;"&gt;Current&lt;/span&gt; member that returns the current value of the enumerator as an object.  It is this object that is then copied into the row, using the fill function.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;All of these are defined in terms of the enumeration for the &lt;span style="font-family:courier new;"&gt;Dictionary&lt;/span&gt; class, so the code itself is quite simple.  Note that everything in this class is set up only for the enumeration and not for SQL code.  The class has no SQL-specific compiler directive, or functions like &lt;span style="font-family: courier new;"&gt;Write()&lt;/span&gt; and&lt;span style="font-family: courier new;"&gt; Read()&lt;/span&gt;.  It is the fill-row function that takes the value returned by the enumerator and transfers the value into the SQL Server world.&lt;br /&gt;&lt;br /&gt;The &lt;span style="font-family:courier new;"&gt;AllCells&lt;/span&gt; enumeration function provide a more complicated example.  In this case, the calculations are done explicitly, because there is no underlying type to support the functionality.&lt;br /&gt;&lt;br /&gt;&lt;span style=";font-family:arial;font-size:180%;"  &gt;&lt;span style="color: rgb(0, 102, 0); font-weight: bold;"&gt;Table Valued Functions and Modeling&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;Table valued functions are a very powerful feature of SQL Server.  However, they are ancillary to my goals, which is to understand how to extend the SQL language to support data mining concepts such as modeling.&lt;br /&gt;&lt;br /&gt;They do have one very large short-coming, which is the fact that their interface does not support parallel scalability.  This is significant, because my choice of SQL is partly due to its scalability.  Remember that the user defined aggregation functions include a &lt;span style="font-family:courier new;"&gt;Merge()&lt;/span&gt; method which does support parallelism.  There is no corresponding capability for table valued functions.&lt;br /&gt;&lt;br /&gt;The preceding three posts have been a detailed exposition on how to incorporate one type of model into SQL Server.  The first explained the model; the second explained the C# code, and this, the third, explains user defined functions.&lt;br /&gt;&lt;br /&gt;Much of this has been prepatory.  The basic marginal value model is more useful as an example than as a modeling tool.  The next post is about making the T-SQL load script a bit simpler.  It will then be follwed by the description of another type of model.  Naive Bayesian model are quite powerful and useful, and actually quite similar to marginal value models.&lt;br /&gt;&lt;a href="http://www.data-miners.com/dataminingsqlserver/blog3enum-load.sql"&gt;&lt;br /&gt;&lt;/a&gt;</content><link rel='alternate' type='text/html' href='http://www.data-miners.com/dataminingsqlserver/2007/11/marginal-value-models-c-table-valued.html' title='Marginal Value Models: C# Table Valued Functions (Part 3)'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8614680675185707253&amp;postID=7856584213528356087' title='0 Comments'/><link rel='replies' type='application/atom+xml' href='http://www.data-miners.com/dataminingsqlserver/atom.xml' title='Post Comments'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8614680675185707253/posts/default/7856584213528356087'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8614680675185707253/posts/default/7856584213528356087'/><author><name>Gordon S. Linoff</name></author></entry><entry><id>tag:blogger.com,1999:blog-8614680675185707253.post-9126588510212112853</id><published>2007-11-02T18:15:00.000-04:00</published><updated>2007-11-03T18:43:14.274-04:00</updated><title type='text'>Marginal Value Models: Overview of C# Code (Part 2)</title><content type='html'>&lt;div&gt;[This post is part of a series where I'm exploring how to add data mining functionality to the SQL language; this is an extension of my most recent book &lt;a href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;Data Analysis With SQL and Excel&lt;/a&gt;. The first post is available &lt;a href="http://www.data-miners.com/dataminingsqlserver/2007/09/extending-sql-server-to-support-some.html"&gt;here&lt;/a&gt;.]&lt;br /&gt;&lt;br /&gt;Marginal value models are a very simple type of model that calculate expected values along dimensions. The &lt;a href="http://www.data-miners.com/dataminingsqlserver/2007/10/marginal-value-models-building-and.html%29."&gt;previous posting&lt;/a&gt; explains them in more detail.&lt;br /&gt;&lt;br /&gt;This posting discusses C# coding issues in implementing the models. The next post discusses one particular aspect, which is the ability to return the marginal values created by the model.&lt;br /&gt;&lt;br /&gt;For reference, the files associates with the model are available at:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.data-miners.com/dataminingsqlserver/blog3-load.sql"&gt;blog3-load.sql&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.data-miners.com/dataminingsqlserver/blog3.dll"&gt;blog3.dll&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.data-miners.com/dataminingsqlserver/blog3.cs"&gt;blog3.cs&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;The first two files contain the DLL and SQL code for loading functionality into SQL Server. The third file contains the source code for the functionality.&lt;br /&gt;&lt;p&gt;&lt;span style="color: rgb(0, 102, 0);font-family:arial;font-size:180%;"  &gt;&lt;strong&gt;Overview of Model and Classes&lt;/strong&gt;&lt;/span&gt;&lt;/p&gt;The marginal value model does a very simple calculation. For each dimension, the marginal value model remembers the counts for all values along all dimensions.  The goal is to calculate an expected value for a combination of dimensions, which involves the following steps:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Divide the count for each value by the total count. This gets a p-value for each value.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Multiply all the p-values together.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Multiply the result by the total count.&lt;/li&gt;&lt;/ol&gt;The result is the expected value. The rest of this post discusses the implementation in C#, starting with the model itself, then the code to create it.&lt;br /&gt;&lt;p&gt;&lt;span style="color: rgb(0, 102, 0);font-family:arial;font-size:180%;"  &gt;&lt;strong&gt;Defining &lt;span style="font-family:courier new;"&gt;BasicMarginalValueModel&lt;/span&gt;&lt;/strong&gt;&lt;/span&gt;&lt;/p&gt;The model is stored as a class. The following declaration defines a class for a model:&lt;br /&gt;&lt;p&gt;&lt;span style="font-family:courier new;"&gt;[Serializable]&lt;br /&gt;[Microsoft.SqlServer.Server.SqlUserDefinedType(Format.UserDefined,&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;MaxByteSize = 8000)]&lt;br /&gt;public class BasicMarginalValueModel :&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;INullable, IBinarySerialize, System.Collections.IEnumerable&lt;/span&gt;&lt;/p&gt;This definition includes several compiler directives needed for the interface to SQL Server.  The first, &lt;span style="font-family:courier new;"&gt;Serializable&lt;/span&gt;, means that the data in the model can be written to and read from, essentially, a file. In English, this implies that the methods &lt;span style="font-family:courier new;"&gt;Write()&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;Read()&lt;/span&gt; are defined.&lt;br /&gt;&lt;p&gt;The next directive specifies information about the type for the compiler. The maximum size of the type is 8,000 bytes. This is a SQL Server limit, alas. Also, remember that it applies to the &lt;span style="font-family:courier new;"&gt;Write()&lt;/span&gt; version of the model, not to the actual size in memory. The compiler option &lt;span style="font-family:courier new;"&gt;Format.UserDefined&lt;/span&gt; says that we are using useful types, so we need to write our own &lt;span style="font-family:courier new;"&gt;Write()&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;Read()&lt;/span&gt; routines. SQL Server can handle just a few types automatically; however, writing code using only unsigned integer values is a great limitation.&lt;/p&gt;&lt;p&gt;As a comment on this approach: it turns out that much of what we are doing -- putting values in and out of memory, defining NULL and so on -- is the type of work done by compilers. Fortunately, much of this work is rather mindless and esay.  So after doing it once, it is easy to do it again for the next type.&lt;br /&gt;&lt;/p&gt;The class itself inherits from three different classes; the first two are described in this entry. The second is decribed in the next one because it introduces a special type of functionality. The first in the list is the &lt;span style="font-family:courier new;"&gt;INullable&lt;/span&gt; class which enables the value to be NULL. In practice, this means that the following code fragment is in the class:&lt;br /&gt;&lt;code&gt;&lt;/code&gt;&lt;p&gt;&lt;span style="font-family:courier new;"&gt;public bool isNull;&lt;/span&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:courier new;"&gt;public bool IsNull&lt;br /&gt;{&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;get&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;{&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;........&lt;/span&gt;return isNull;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;}&lt;br /&gt;} // IsNull&lt;/span&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:courier new;"&gt;public static BasicMarginalValueModel Null&lt;br /&gt;{&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;get&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;{&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;........&lt;/span&gt;BasicMarginalValueModel bmvm =&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;............&lt;/span&gt;new BasicMarginalValueModel();&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;return bmvm;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;} &lt;img alt="Text Color" src="http://www.blogger.com/img/gl.color.fg.gif" border="0" /&gt;&lt;br /&gt;} // Null&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;&lt;/p&gt;This code defines the &lt;span style="font-family:courier new;"&gt;NULL&lt;/span&gt; value for the class (this is something that has the type of the class and the value of NULL) and the &lt;span style="font-family:courier new;"&gt;IsNull&lt;/span&gt; property, required by the &lt;span style="font-family:courier new;"&gt;INullable&lt;/span&gt; class.  There is little reason to vary this code.  Personally, I think the &lt;span style="font-family:courier new;"&gt;INullable&lt;/span&gt; class could just implement it.  I suppose the flexibility is there, though, so the boolean variable &lt;span style="font-family:courier new;"&gt;isNull&lt;/span&gt; does not have to be a member of the class.&lt;br /&gt;&lt;p&gt;The &lt;span style="font-family:courier new;"&gt;IBinarySerialize&lt;/span&gt; parent class requires the &lt;span style="font-family:courier new;"&gt;Read()&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;Write()&lt;/span&gt; functions.&lt;/p&gt;&lt;span style="color: rgb(0, 102, 0);font-family:arial;font-size:180%;"  &gt;&lt;strong&gt;Members of BasicMarginalValueModel&lt;/strong&gt;&lt;/span&gt;&lt;br /&gt;&lt;p&gt;In order to be used, the model must contain the count for each value along each dimension. This table is, in fact, all that is needed for the model. The dimension values are assumed to be strings; the value being stored is the p-value, which is a double. In C#, the appropriate data structure is a dictionary, a built-in data structure which in common parlance is better known as a hash table.  Perhaps the biggest strength of C# is the wealth of its built in container classes, so use them liberally.&lt;br /&gt;&lt;/p&gt;The first step in using a dictionary is to tell C# where the definition is by including the following line at the top of the file:&lt;br /&gt;&lt;p&gt;&lt;span style="font-family:courier new;"&gt;using System.Collections.Generic;&lt;/span&gt;&lt;/p&gt;The "using" clause is similar to an "#include" in the sense that both bring in outside definitions. However, "using" provides much more detail to the compiler, including compiler directives and definitions.&lt;br /&gt;&lt;p&gt;&lt;/p&gt;The dictionary class is generic. We have to tell it the types that it is storing. The following code describes the dictionary as we want to use it:&lt;br /&gt;&lt;p&gt;&lt;span style="font-family:courier new;"&gt;public System.Collections.Generic.Dictionary&lt;string,double&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;string,&gt; marginals;&lt;/string,&gt;&lt;/string,double&gt;&lt;/span&gt;&lt;/p&gt;This syntax says to use the generic dictionary definition, where the key is a string (this is the thing being looked up) and the value is a double (this is the p-value), to define the class variable &lt;span style="font-family:courier new;"&gt;marginals&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;The dictionary uses a trick to store all values along all dimensions.  A potential problem is that a given value might be valid for different dimensions. Instead of the key simply being the value, it is a composite key consisting of the dimension number (starting from zero) followed by a colon and then the value.  So, the value "upper" for the first dimension would be stored in the key "0:upper".  One additional entry in the dictionar is also defined.  The value "total:" represents the total count along all dimensions.&lt;br /&gt;&lt;br /&gt;By the way, the creation of the key and the parsing of the dimension and value from the key should probably be separate private functions in the class.  However, this code does not implement them this way.&lt;br /&gt;&lt;p&gt;&lt;/p&gt;The only additional members of the class are &lt;span style="font-family:courier new;"&gt;isNull &lt;/span&gt;and &lt;span style="font-family:courier new;"&gt;numdimensions&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-family:arial;font-size:180%;"  &gt;&lt;span style="color: rgb(0, 102, 0);"&gt;Notes on Methods in BasicMarginalValueModel&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;In addition to the standard methods of the class, there are several additional methods.  Most involve the functions needed to return the values in the model, which is discussed in the next post.  However, two &lt;span style="font-family:courier new;"&gt;Score()&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;ChiSquared()&lt;/span&gt; are functions that are intended to be accessed from SQL.  The advantage of putting these in the model class is that they can be directly accessed from SQL without having to define them using &lt;span style="font-family:courier new;"&gt;CREATE FUNCTION&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;Both these functions call an internal function _&lt;span style="font-family:courier new;"&gt;Score()&lt;/span&gt; to do the calculation.  Unfortunately, C# and SQL Server do not do a good job with function overloading, so this function is simply given a different name.  That is, if &lt;span style="font-family:courier new;"&gt;Score()&lt;/span&gt; (or any other function) is overloaded, then it generates an error in SQL Server.&lt;br /&gt;&lt;p&gt;&lt;/p&gt;The &lt;span style="font-family:courier new;"&gt;Write()&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;Read()&lt;/span&gt; functions have obvious definitions with two small caveats.  First, the number of items in the dictionary is written out.  Then the number of dimensions, and then each dictionary entry.  The number of items is needed so &lt;span style="font-family:courier new;"&gt;Read()&lt;/span&gt; knows when it is finished.  The value is used in the loop.&lt;br /&gt;&lt;br /&gt;In addition, there is the danger that "total:" will be defined twice during the &lt;span style="font-family:courier new;"&gt;Read()&lt;/span&gt;, once given the value of zero when an instance of the class is created and once when the dictionary entries are read.  To prevent this, the value is removed from the dictionary.  This step is not strictly necessary, because it happens not to be there.  However, it is a good reminder.&lt;br /&gt;&lt;p&gt;&lt;/p&gt;&lt;br /&gt;&lt;span style=";font-family:arial;font-size:180%;"  &gt;&lt;span style="color: rgb(0, 102, 0); font-weight: bold;"&gt;Implementation of &lt;span style="font-family:courier new;"&gt;CreateBasicMarginalValueModel&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;Creating an instance of a marginal value model requires an aggregation.  Such aggregations make use of the following compiler directives and parent classes:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;[Serializable]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;[SqlUserDefinedAggregate(Format.UserDefined,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;IsInvariantToNulls = true,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;IsInvariantToDuplicates = false,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;IsInvariantToOrder = false,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;MaxByteSize = 8000)]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;public class CreateBasicMarginalValueModel : IBinarySerialize&lt;/span&gt;&lt;br /&gt;&lt;p&gt;&lt;/p&gt;The compiler directives specify that this is a serializable class with &lt;span style="font-family:courier new;"&gt;Write()&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;Read()&lt;/span&gt; methods.  The second specifies various features of the aggregation.  For instance, &lt;span style="font-family:courier new;"&gt;IsInvariantToNulls&lt;/span&gt; means that adding in a NULL value does not change the aggregation (think of the difference between &lt;span style="font-family:courier new;"&gt;COUNT(*)&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;COUNT(&lt;column&gt;)&lt;/column&gt;&lt;/span&gt;).&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:180%;"&gt;&lt;span style="color: rgb(0, 102, 0); font-weight: bold;font-family:arial;" &gt;Members and Methods of CreateBasicMarginalValueModel&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The class itself contains one member, an instance of &lt;span style="font-family:courier new;"&gt;BasicMarginalValueModel&lt;/span&gt;.  This is updated using in the &lt;span style="font-family:courier new;"&gt;Accumulate() &lt;/span&gt;and &lt;span style="font-family:courier new;"&gt;Merge()&lt;/span&gt; members.  &lt;span style="font-family:courier new;"&gt;Accumulate() &lt;/span&gt;updates a value for a dimension, either by adding it to the dictionary (if it does not exist) or incrementing the value stored in the dictionary.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;Merge()&lt;/span&gt; actually does the same thing, just with two different dictionaries.  Recall that&lt;span style="font-family:courier new;"&gt; Merge()&lt;/span&gt; is used to support parallelism.  Two different processors might aggregate different chunks of data, which are then combined using this function.&lt;br /&gt;&lt;br /&gt;Because the aggregation value needs to be passed between SQL Server and C#, the serialization routines need to be defined.  However, these are trivial, because they call the corresponding routines for the one member (which are the routines defined for &lt;span style="font-family:courier new;"&gt;BasicMarginalValueModel&lt;/span&gt;).&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; color: rgb(0, 102, 0);font-size:180%;" &gt;&lt;span style="font-family:arial;"&gt;About &lt;span style="font-family:courier new;"&gt;MarginalValueArgs&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;The argument to &lt;span style="font-family:courier new;"&gt;CreateBasicMarginalValueModel&lt;/span&gt; requires both a value and an associated count (because aggregation functions only take one argument, the value and count need to be combined into a single type).  This definition is very similar to &lt;span style="font-family:courier new;"&gt;WeghtedValue&lt;/span&gt; described in an &lt;a href="http://www.data-miners.com/dataminingsqlserver/2007/10/weighted-average-continued-c-code.html"&gt;earlier&lt;/a&gt; posting.&lt;br /&gt;&lt;br /&gt;There is a creation function associated with &lt;span style="font-family:courier new;"&gt;MarginalValueArgs&lt;/span&gt;.  This is standard whenever adding a type.  An associated function is needed to create an instance of the type.&lt;br /&gt;&lt;br /&gt;The next posting describes one additional feature of the basic marginal value model.  This feature is the ability to list all the values in the model, and it introduces the idea of a table-values function.  Such a function is yet another useful extension of SQL Server.&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;/p&gt;&lt;br /&gt;&lt;/div&gt;</content><link rel='alternate' type='text/html' href='http://www.data-miners.com/dataminingsqlserver/2007/11/marginal-value-models-overview-of-c.html' title='Marginal Value Models: Overview of C# Code (Part 2)'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8614680675185707253&amp;postID=9126588510212112853' title='0 Comments'/><link rel='replies' type='application/atom+xml' href='http://www.data-miners.com/dataminingsqlserver/atom.xml' title='Post Comments'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8614680675185707253/posts/default/9126588510212112853'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8614680675185707253/posts/default/9126588510212112853'/><author><name>Gordon S. Linoff</name></author></entry><entry><id>tag:blogger.com,1999:blog-8614680675185707253.post-3462341556652276238</id><published>2007-10-25T16:26:00.000-04:00</published><updated>2007-10-27T00:23:08.961-04:00</updated><title type='text'>Marginal Value Models:  Building and Using Data Mining Models in SQL Server (Part 1)</title><content type='html'>&lt;div&gt;[This post is part of a series where I'm exploring how to add data mining functionality to the SQL language; this is an extension of my most recent book &lt;a href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;&lt;span style="font-style: italic;"&gt;Data Analysis With SQL and Excel&lt;/span&gt;&lt;/a&gt;.  The first post is available &lt;a href="http://www.data-miners.com/dataminingsqlserver/2007/09/extending-sql-server-to-support-some.html"&gt;here&lt;/a&gt;.]&lt;br /&gt;&lt;br /&gt;A marginal value model is a very simple type of model. However, it gives a good example of how to implement a data mining model using SQL Server extensions. Recall that the model itself produces an expected value, in the same way as the chi-square calculation. This expected value is the model estimate.&lt;br /&gt;&lt;br /&gt;The inputs to the model are the dimensions for the estimate, and these are necessarily categorical variables (strings) that take on, preferably, just a handful of values. As a note, the version described here has limits on the total number of values along all dimensions; these limits are imposed by SQL Server and discussed at the end of this post.&lt;br /&gt;&lt;br /&gt;This post describes how to use the model. The next post describes how it is implemented. The third post describes some additional useful technical details.&lt;br /&gt;&lt;br /&gt;This posting has three files attached:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.data-miners.com/dataminingsqlserver/blog3-load.sql"&gt;blog3-load.sql&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.data-miners.com/dataminingsqlserver/blog3.dll"&gt;blog3.dll&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.data-miners.com/dataminingsqlserver/blog3.cs"&gt;blog3.cs&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt; As explained in the first posting, the first two files are a DLL containing the functionality and a T-SQL script that loads it into the database. The third file contains the code.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 102, 0);font-family:arial;font-size:180%;"  &gt;&lt;strong&gt;BasicMarginalValueModel Type&lt;/strong&gt;&lt;/span&gt;&lt;br /&gt;The marginal value model itself is a data type in SQL Server. The data type &lt;span style="font-family:courier new;"&gt;BasicMaringalValueModel&lt;/span&gt; is implemented as a C# class containing all the information that describes the model as well as various functions, such as:&lt;br /&gt;&lt;/div&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-family:courier new;"&gt;ToString()&lt;/span&gt;, which converts the information describing the model to a string.&lt;/li&gt;&lt;li&gt;&lt;span style="font-family:courier new;"&gt;Parse()&lt;/span&gt;, which parses a string containing information describing the model.&lt;/li&gt;&lt;li&gt;&lt;span style="font-family:courier new;"&gt;Write()&lt;/span&gt;, which is like &lt;span style="font-family:courier new;"&gt;ToString()&lt;/span&gt; except the format is binary instead of character.&lt;/li&gt;&lt;li&gt;&lt;span style="font-family:courier new;"&gt;Read()&lt;/span&gt;, which is like &lt;span style="font-family:courier new;"&gt;Parse()&lt;/span&gt; except the format is binary instead of character.&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;Actually, there is a subtle difference between the &lt;span style="font-family:courier new;"&gt;ToString()&lt;/span&gt;/&lt;span style="font-family:courier new;"&gt;Parse()&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;Write()&lt;/span&gt;/&lt;span style="font-family:courier new;"&gt;Read()&lt;/span&gt; pairs. &lt;span style="font-family:courier new;"&gt;Write()&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;Read()&lt;/span&gt; are used implicitly when using a type with SQL Server. They do get called. On the other hand, &lt;span style="font-family:courier new;"&gt;ToString()&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;Parse()&lt;/span&gt; are only used by SQL Server when reading and writing &lt;span style="font-family:courier new;"&gt;BasicMarginalValueModel&lt;/span&gt; values to or from text. However, &lt;span style="font-family:courier new;"&gt;ToString()&lt;/span&gt; is very handy for seeing what is happening, so it is also used manually.&lt;br /&gt;&lt;br /&gt;What does the model "look" like? It looks like pairs of values. So, if the model contained 50 states and three region types ("urban", "rural", and "mixed"), then an instance of the model would contain up to 53 key-value pairs. The key combines the dimension number (0 for state, 1 for region type) and dimension value (state or region type). The second has the value associated with it.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 102, 0);font-family:arial;font-size:180%;"  &gt;&lt;strong&gt;An Aggregation, Another Type, and a Function&lt;/strong&gt;&lt;/span&gt;&lt;br /&gt;Having a type is useful, but how do we create values of the type? The answer is simple, the &lt;span style="font-family:courier new;"&gt;CreateBasicMarginalValueModel&lt;/span&gt; aggregation function. This aggregation adds up the counts on all dimensions.&lt;br /&gt;&lt;br /&gt;Ideally, the aggregation function could be called as:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;SELECT ud.dbo.CreateMarginalValueModel(dim1, dim2, . . ., value)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;FROM t&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;However, this is not possible, because aggregation functions can only take one argument.  The data type &lt;span style="font-family:courier new;"&gt;MarginalValueModelArgs&lt;/span&gt; stores one or more dimensions along with a value.  The value would typically be 1; however, it is also possible to create the models on summarized or partially summarized data.&lt;/div&gt;&lt;br /&gt;&lt;div&gt; &lt;/div&gt;&lt;br /&gt;&lt;div&gt;This type has a creation function associated with it, &lt;span style="font-family:courier new;"&gt;MarginalValueArgs1()&lt;/span&gt;.  This takes the first dimension and the value.  To add more dimensions, the type defines a function &lt;span style="font-family:courier new;"&gt;AddDim()&lt;/span&gt;.  The second, third, and so forth dimensions can be added using this function.&lt;/div&gt;&lt;br /&gt;&lt;div&gt; &lt;/div&gt;&lt;br /&gt;&lt;div&gt;Defining functions in user defined types is highly recommended -- except for the issue of performance.  Once you have a value of the type, the functions are accessible.  And, they do not need to be defined in SQL Server.  They come automatically with the type.  Of course, accessing the function seems to require shuffling the type data back and forth from the DLL to  SQL Server, reducing performance.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 102, 0);font-family:arial;font-size:180%;"  &gt;&lt;strong&gt;What Creating a Model Looks Like&lt;/strong&gt;&lt;/span&gt;&lt;br /&gt;The following code shows one way to create a model using state and region type as two dimensions:&lt;/div&gt;&lt;br /&gt;&lt;div&gt;&lt;span style="font-family:courier new;"&gt;SELECT ud.dbo.CreateBasicMarginalValueModel(arg).ToString()&lt;br /&gt;FROM (SELECT ud.dbo.MarginalValueArgs1(state, 1).&lt;/span&gt;&lt;/div&gt;&lt;span style="color: rgb(255, 255, 255);font-family:courier new;" &gt;.............&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;AddDim(regtype) as arg&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;FROM (SELECT zc.*,&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;...................&lt;/span&gt;(CASE WHEN purban = 1 THEN 'urban'&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.........................&lt;/span&gt;WHEN purban = 0 THEN 'rural'&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.........................&lt;/span&gt;ELSE 'mixed' END) as regtype&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;............&lt;/span&gt;FROM sqlbook..zipcensus zc) zc) zc&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;&lt;div&gt; &lt;/div&gt;There are three layers of queries.  The innermost query defines the region type.  The next level defines the inputs into the model creation routine.  Notice that the function &lt;span style="font-family:courier new;"&gt;MarginalValueArgs1()&lt;/span&gt; defines the first dimension on state and the &lt;span style="font-family:courier new;"&gt;AddDim()&lt;/span&gt; function defines the second.&lt;br /&gt;&lt;div&gt; &lt;/div&gt;&lt;br /&gt;&lt;div&gt;Although this is useful for illustration, the model only exists long enough for us to see it using the &lt;span style="font-family:courier new;"&gt;ToString()&lt;/span&gt; function.  When the query stops executing, the model is no longer accessible.&lt;/div&gt;&lt;div&gt; &lt;/div&gt;&lt;br /&gt;&lt;div&gt;The following code assigns the model to a variable.  The model can then be referenced in multiple select statements.  Note that for this to work, the current database must be "ud", because that is where the data types are defined.  Currently, it is not possible to define variables using data types defines in other databases.&lt;/div&gt;&lt;div&gt; &lt;/div&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;DECLARE @model BasicMarginalValueModel&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;span style="font-family:courier new;"&gt;SET @model =&lt;/span&gt;&lt;/div&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;(SELECT ud.dbo.CreateBasicMarginalValueModel(arg)&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;span style="font-family:Courier New;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt; FROM (SELECT ud.dbo.MarginalValueArgs1(state, 1).&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....................&lt;/span&gt; AddDim(regtype) as arg&lt;/span&gt;&lt;/div&gt;&lt;span style="font-family:Courier New;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.......... &lt;/span&gt;FROM (SELECT zc.*,&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;span style="font-family:Courier New;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.......................&lt;/span&gt; (CASE WHEN purban = 1 THEN 'urban'&lt;/span&gt;&lt;/div&gt;&lt;span style="font-family:Courier New;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;............................. &lt;/span&gt;WHEN purban = 0 THEN 'rural'&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;span style="font-family:Courier New;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.............................&lt;/span&gt; ELSE 'mixed' END) as regtype&lt;/span&gt;&lt;/div&gt;&lt;span style="font-family:Courier New;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;................ &lt;/span&gt;FROM sqlbook..zipcensus zc) zc) zc&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;span style="font-family:courier new;"&gt;&lt;/span&gt; &lt;/div&gt;&lt;br /&gt;&lt;div&gt;&lt;span style="font-family:courier new;"&gt;SELECT @model.ToString()&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;&lt;div&gt; &lt;/div&gt;In this case, the @model variable is accessible for the statements, but it does not persist.  However, because the model is just a variable with a complicated type it could also be stored in a table.&lt;br /&gt;&lt;div&gt; &lt;/div&gt;&lt;br /&gt;&lt;div&gt; &lt;/div&gt;&lt;span style="color: rgb(0, 102, 0);font-family:arial;font-size:180%;"  &gt;&lt;strong&gt;Scoring a Model&lt;/strong&gt;&lt;/span&gt;&lt;br /&gt;&lt;div&gt;The process of scoring is simply applying the model to a given set of values.  For instance, the following query scores all the rows in the zc table:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;SELECT @model.Score(arg)&lt;/span&gt; &lt;span style="font-family:courier new;"&gt;FROM (SELECT ud.dbo.MarginalValueArgs1(state, 1).&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.................&lt;/span&gt;AddDim(regtype) as arg&lt;br /&gt;&lt;/span&gt; &lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;FROM (SELECT zc.*,&lt;br /&gt;&lt;/span&gt; &lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;...................&lt;/span&gt;(CASE WHEN purban = 1 THEN 'urban'&lt;/span&gt; &lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.........................&lt;/span&gt;WHEN purban = 0 THEN 'rural'&lt;/span&gt; &lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.........................&lt;/span&gt;ELSE 'mixed' END) as regtype&lt;/span&gt; &lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;............&lt;/span&gt;FROM sqlbook..zipcensus zc) zc) zc&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;That is, the dimensions are bundled together into &lt;span style="font-family:courier new;"&gt;MarginalValueArgs&lt;/span&gt; and passed to the model for scoring.&lt;br /&gt;&lt;br /&gt;The model can also be used to calculate the chi-squared value (which is probably the most useful thing to do with such a model).  This is simply another function in the &lt;span style="font-family:courier new;"&gt;BasicMarginalValueModel&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;SELECT @model.ChiSquared(arg)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;FROM (SELECT state, regtype, count(*) as cnt,&lt;/span&gt; &lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.............&lt;/span&gt;ud.dbo.MarginalValueArgs1(state, &lt;/span&gt;&lt;span style="font-family:courier new;"&gt;count(*)).&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.......................................&lt;/span&gt;AddDim(regtype) as arg&lt;/span&gt; &lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;FROM (SELECT zc.*,&lt;br /&gt;&lt;/span&gt; &lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;...................&lt;/span&gt;(CASE WHEN purban = 1 THEN 'urban'&lt;/span&gt; &lt;span style="color: rgb(255, 255, 255);font-family:courier new;" &gt;.........................&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;WHEN purban = 0 THEN 'rural'&lt;/span&gt; &lt;span style="color: rgb(255, 255, 255);font-family:courier new;" &gt;.........................&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;ELSE 'mixed' END) as regtype&lt;/span&gt; &lt;span style="color: rgb(255, 255, 255);font-family:courier new;" &gt;............&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;FROM sqlbook..zipcensus zc) zc&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;GROUP BY state, regtype&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;) zc&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 102, 0);"&gt;&lt;span style="font-size:180%;"&gt;&lt;strong&gt;&lt;span style="font-family:arial;"&gt;Limits on the Model&lt;/span&gt;&lt;br /&gt;&lt;/strong&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;The interface between C# and SQL Server limits the size of the model to 8,000 bytes.  This severely limits the size of the model.  In future postings, I'll suggest an alternative implementation that gets around this limit.&lt;br /&gt;&lt;br /&gt;The next posting discusses the C# implementation and the one after that extensions to the model.&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;script type="text/javascript"&gt;&lt;br /&gt;&lt;br /&gt;_uacct = "UA-380835-1";&lt;br /&gt;&lt;br /&gt;urchinTracker();&lt;br /&gt;&lt;br /&gt;&lt;/script&gt;&lt;br /&gt;&lt;a href="http://www.data-miners.com/dataminingsqlserver/blog3.cs"&gt;&lt;/a&gt;</content><link rel='alternate' type='text/html' href='http://www.data-miners.com/dataminingsqlserver/2007/10/marginal-value-models-building-and.html' title='Marginal Value Models:  Building and Using Data Mining Models in SQL Server (Part 1)'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8614680675185707253&amp;postID=3462341556652276238' title='0 Comments'/><link rel='replies' type='application/atom+xml' href='http://www.data-miners.com/dataminingsqlserver/atom.xml' title='Post Comments'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8614680675185707253/posts/default/3462341556652276238'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8614680675185707253/posts/default/3462341556652276238'/><author><name>Gordon S. Linoff</name></author></entry><entry><id>tag:blogger.com,1999:blog-8614680675185707253.post-3024524390868009099</id><published>2007-10-20T21:36:00.000-04:00</published><updated>2007-10-22T18:55:03.711-04:00</updated><title type='text'>Marginal Value Models:  Explanation</title><content type='html'>&lt;p&gt;This posting describes a very simple type of model used when the target of the model is numeric and all the inputs are categorical variables. This posting explains the concepts behind the models. The next posting has the code associated with them.&lt;/p&gt;I call these models &lt;span style="FONT-STYLE: italic"&gt;marginal value models&lt;/span&gt;. In statistics, the term "marginal" means that we are looking at only one variable at a time. Marginal value models calculate the contribution from each variable, and then combine the results into an expected value.&lt;br /&gt;&lt;p&gt;The chi-square test operates in a similar fashion, but takes the process one step further. The chi-square test compares the actual value to the expected value to determine whether they are sufficiently close to due to small random variations -- or far enough apart to be suspicious. Both marginal value models and the chi-square test are discussed in more detail in my most recent book &lt;span style="FONT-STYLE: italic"&gt;Data Analysis Using SQL and Excel&lt;/span&gt;. Here the emphasis is a bit different; the focus in on implementing this type of model as an extension to Excel.&lt;/p&gt;&lt;br /&gt;&lt;p&gt;&lt;span style="FONT-WEIGHT: bold"&gt;&lt;span style="font-family:arial;font-size:180%;color:#006600;"&gt;What are the Marginal Values?&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;For the purposes of this discussion, the marginal values are the values summarized along one of the dimensions. For instance, if we are interested in the population of different parts of the United States, we might have the population for each state. The following query summarizes this information based on a table of zip code summaries (available on the companion web site to "Data Analysis Using SQL and Excel"):&lt;br /&gt;&lt;p&gt;&lt;code&gt;SELECT state, AVG(medincome), SUM(population)&lt;br /&gt;FROM zipcensus&lt;br /&gt;GROUP BY state&lt;br /&gt;&lt;/code&gt;&lt;/p&gt;&lt;p&gt;The resulting histogram shows the distribution along this dimension:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.data-miners.com/dataminingsqlserver/uploaded_images/pic-724619.jpg"&gt;&lt;img style="DISPLAY: block; MARGIN: 0px auto 10px; CURSOR: pointer; TEXT-ALIGN: center" alt="" src="http://www.data-miners.com/dataminingsqlserver/uploaded_images/pic-724616.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The exact values are not known. What if we also know the median income for urban, rural, and mixed areas of the country? These might have the following values:&lt;br /&gt;&lt;/p&gt;&lt;pre&gt;MIXED 148,595,327&lt;br /&gt;RURAL  27,240,454&lt;br /&gt;URBAN 109,350,778&lt;/pre&gt;&lt;p&gt;Given this information about population along two dimensions, how can we combine the information to estimate, say, the rural populatoin of New York?&lt;/p&gt;&lt;p&gt;Those familiar with the chi-square test recognize this as the question of the expected value. In this situation, the expected value is the total population of the state times the total population of the area category divided by the total population in the United States. That is, it is the row total times the column total divided by the total.&lt;/p&gt;For rural Alabama, this results in the following calculation: 4,446,124*27,240,454/285,186,559-424,685. This provides an estimate calculated by combining the inforamtion summarized along each dimension.&lt;br /&gt;&lt;p&gt;Is this estimate accurate? That is quite another question. If the two dimensions are statistically independent, then the estimate is quite accurate. If there is an interaction effect, then the stimate is not accurate. However, if all we have are summaries along the dimensions, then this might be the best that we can do.&lt;/p&gt;&lt;br /&gt;&lt;span style="FONT-WEIGHT: bold;font-size:180%;" &gt;&lt;span style="font-family:arial;color:#006600;"&gt;Combining Values Along More Than Two Marginal Dimensions&lt;/span&gt; &lt;/span&gt;&lt;br /&gt;&lt;p&gt;The formula for the expected value can be easily extended to multiple dimensions. The idea is to multiply ratios rather than counts. The two-dimension case can be thought of as the product of the following three numbers:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;The proportion of the population along dimension 1.&lt;/li&gt;&lt;li&gt;The proportion of the population along dimension 2.&lt;/li&gt;&lt;li&gt;The total population.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;That is, we are multiplying proportions (or probabilities, if you prefer). The idea is that the "probability" of being in Alabama is the population of Alabama divided by the population of the country. The "probability" of being rural is the rural population divided by the population of the county. The "probability" of both is the product. To get the count, we multiply by "joint probability" by the population of the country. &lt;/p&gt;&lt;p&gt;This is easily extended to multiple dimensions. The overall "probability" is the product of the "probabilities" along each dimension. To get the count, we then have to multiply by the overall population. Mathematically, the idea is to combine the distibutions along each dimension, assuming statistical independence. The term "probability" appears in quotes -- it is almost a philosophical question whether "probabilities" are the same as "proportions", and that is not the subject of this posting.&lt;br /&gt;&lt;/p&gt;&lt;p&gt;This formulation of the problem is quite similar to naive Bayesian models. The only difference is that here we are working with counts and naive Bayesian models work with ratios. I will return to naive Bayesian models in later postings.&lt;/p&gt;&lt;br /&gt;&lt;span style="FONT-WEIGHT: bold;font-family:arial;font-size:180%;color:#006600;"   &gt;Combining Things That Aren't Counts&lt;/span&gt;&lt;br /&gt;&lt;p&gt;Certain things are not counts, but can be treated as counts for the purpose of calculating expected values. The key idea is that the overall totals must be the same (or at least quite close).&lt;/p&gt;For example, the census date contains the proportion of the population that has some colelge degree. What if we wanted to estimate this proportion for the urban population in new York?&lt;br /&gt;What we need for the marginal value model to work is simply the ability to count things up along the dimensions. In this case, we are tempted to count the proportion of the population of interest (since that is the data we have and what the question ultimately asks for). &lt;p&gt;&lt;/p&gt;However, we cannot use proportions because they do not "add up" to the same total numbers along each dimension. This means that if we take the sum of the proportions in each state the total will be quite different than the sum of the proportions for urban, rural, and mixed. If for no other reason, adding up fifty numbers (or so) is unlikely to produce the same result as adding up three.&lt;br /&gt;&lt;p&gt;Fortunately, there is a simple solution. Multiply the proportion by the appropriate population in each group, to get the number of college educated people in each group. This number adds up appropriate along each dimension, so we can use it in the formulas described above.&lt;/p&gt;In the end, we get the number of people in, say rural Alabama who have a college education. We can then divide by the estimate for the population, and arrive at an answer to the question.&lt;br /&gt;&lt;p&gt;This method works with other numbers of interest, such as the average income. The idea would be to multiply the average income times the population to get dollars. Dollars then add up along the dimensions, and we can calculate the appropriate values in each group.&lt;/p&gt;&lt;span style="FONT-WEIGHT: bold;font-family:arial;font-size:180%;color:#006600;"   &gt;Chi-Square Test&lt;/span&gt;&lt;br /&gt;&lt;p&gt;The final topic in this chapter is to point out the calculation of chi-square value, using the marginal value model. The chi-square value is simply:&lt;br /&gt;&lt;/p&gt;&lt;pre&gt;&lt;code&gt;chi-square value = sum((actual - expected)^2/expected)&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;The value can be used as a measure of how close the observed data is to the expected values. In other words, it is a measure of how statistically independent the dimensions are. Higher values suggest interdependencies. Values closer to 0 means that the dimensions are independent.&lt;br /&gt;&lt;p&gt;&lt;/p&gt;This posting describes the background for marginal value models. The next posting describes how to add them into SQL Server.</content><link rel='alternate' type='text/html' href='http://www.data-miners.com/dataminingsqlserver/2007/10/marginal-value-models-explanation.html' title='Marginal Value Models:  Explanation'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8614680675185707253&amp;postID=3024524390868009099' title='0 Comments'/><link rel='replies' type='application/atom+xml' href='http://www.data-miners.com/dataminingsqlserver/atom.xml' title='Post Comments'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8614680675185707253/posts/default/3024524390868009099'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8614680675185707253/posts/default/3024524390868009099'/><author><name>Gordon S. Linoff</name></author></entry><entry><id>tag:blogger.com,1999:blog-8614680675185707253.post-1331409855675409866</id><published>2007-10-14T18:48:00.000-04:00</published><updated>2007-10-14T22:56:10.348-04:00</updated><title type='text'>Two More Useful Aggregate Functions:  MinOf() and MaxOf()</title><content type='html'>&lt;script src="http://www.google-analytics.com/urchin.js" type="text/javascript"&gt;&lt;/script&gt;The overall purpose of this blog is to investigate adding data mining functionality into SQL Server (see the first post for a more detailed explanation).  We have not yet arrived at adding real data mining functionality, since this requires being comfortable with .NET, C#, and extending SQL Server.&lt;br /&gt;&lt;br /&gt;This post offers two more aggregation functions that provide a flavor for how to think about adding analytic capabilities.  These functions return the value in one column when the value of another column is at a minimum or maximum.  I call the functions &lt;span style="font-family:courier new;"&gt;MinOf()&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;MaxOf()&lt;/span&gt;.  As a brief aside, my most recent book &lt;a href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers" target="Amazon"&gt;Data Analysis Using SQL and Excel&lt;/a&gt; describes various other techniques for getting this information in SQL without adding new functions into the database.  Unfortunately, none of the methods is actually elegant.&lt;br /&gt;&lt;br /&gt;The attached files contain the source code as well as a DLL and SQL script for loading functionality into the database.  These files are:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.data-miners.com/dataminingsqlserver/blog2.cs"&gt;blog2.cs&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.data-miners.com/dataminingsqlserver/blog2.dll"&gt;blog2.dll&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.data-miners.com/dataminingsqlserver/blog2-load.sql"&gt;blog2-load.sql&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;Note that these files contain all the functionality in the blog1 files as well as the new functionality here.  (The earlier post &lt;a href="http://www.data-miners.com/dataminingsqlserver/2007/09/weighted-average-example-of-enhancing.html"&gt;Weighted Average: An Example of Enhancing SQL Server Functionality&lt;/a&gt; explains how to load the functionality into the database.)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 102, 0); font-weight: bold;font-size:180%;" &gt;&lt;span style="font-family:arial;"&gt;Thinking About the Problem&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;A good place to start is to think about what the code would ideally look like.  The functions would look like:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;SELECT 〈whatever〉, MINOF(〈value〉, 〈min-column〉), MAXOF(〈value〉, 〈max-column〉)&lt;br /&gt;FROM 〈table〉&lt;br /&gt;GROUP BY 〈whatever〉&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;This construct could be used, for instance, to find the first product purchased by each customer.  Or, the most recent amount spent for each customer.&lt;br /&gt;&lt;br /&gt;Alas, we cannot extend SQL server to support such functions, because aggregation functions can only take one argument.  This means that we have to add a new type &lt;span style="font-family:courier new;"&gt;ValuePair&lt;/span&gt; to handle the two arguments.  But even more alas, the two elements of &lt;span style="font-family:courier new;"&gt;ValuePair&lt;/span&gt; can be of any type for the function to really be useful (for simplicity, we'll limit it to any built-in basic type).  That means that we need yet another user defined type, &lt;span style="font-family:courier new;"&gt;AnyType&lt;/span&gt;.  I suppose these could be compressed into a single type that took pairs of anytype.  However, it is much cleaner to break the code into these pieces.&lt;br /&gt;&lt;br /&gt;The result is that the above code instead looks like:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;SELECT 〈whatever〉, MINOF(vp).ToDouble(), MAXOF(vp).ToDouble()&lt;br /&gt;FROM (SELECT t.*,&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.............&lt;/span&gt;ud.dbo.ValuePair(ud.dbo.AnyDouble(〈value〉),&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;..............................&lt;/span&gt;ud.dbo.AnyDateTime(〈min-column〉)) as vp&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;FROM 〈table〉 t) t&lt;br /&gt;GROUP BY 〈whatever〉&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;The variable &lt;span style="font-family:courier new;"&gt;vp&lt;/span&gt; becomes an instance of &lt;span style="font-family:courier new;"&gt;ValuePair&lt;/span&gt; for each row.  In this case, it consists of floating point value (which is the value returned)  and a date time column.  Of course, there are "Any" functions for all the built-in types.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 102, 0);font-family:arial;font-size:180%;"  &gt;&lt;span style="font-weight: bold;"&gt;What The Solution Looks Like&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;The solution consts of two built-in types, two aggregation functions, and various support functions:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-family:courier new;"&gt;AnyType&lt;/span&gt; which represents any SQL type;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family:courier new;"&gt;ValuePair&lt;/span&gt; which contains two &lt;span style="font-family:courier new;"&gt;AnyType&lt;/span&gt;s;&lt;/li&gt;&lt;li&gt;MinOf() and MaxOf() aggregation functions; and,&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Various functions to create instances of &lt;span style="font-family:courier new;"&gt;AnyType&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;ValuePair&lt;/span&gt;.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;The trickiest of these is the &lt;span style="font-family:courier new;"&gt;AnyType&lt;/span&gt; type.  The remainder are quite simple, so &lt;span style="font-family:courier new;"&gt;ValuePair&lt;/span&gt; has three members:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;isNull (the null flag);&lt;/li&gt;&lt;li&gt;value1 (of type AnyType); and&lt;/li&gt;&lt;li&gt;value2 (of type AnyType).&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;With the appropriate member fucntions for a user defined type.  It also has methods for accessing the two values, called &lt;span style="font-family:courier new;"&gt;Value1&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;Value2&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;MinOf()&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;MaxOf()&lt;/span&gt; are aggregation functions.  Each contains two private members of type &lt;span style="font-family:courier new;"&gt;AnyType&lt;/span&gt;, the minimum value and the minimum variable.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; color: rgb(0, 102, 0);font-size:180%;" &gt;&lt;span style="font-family:arial;"&gt;Adding the &lt;span style="font-family: courier new;"&gt;AnyType&lt;/span&gt; Type&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;The &lt;span style="font-family: courier new;"&gt;AnyType&lt;/span&gt; type needs to store virtually any type allowed in SQL.  Internally, it has a structure with the following members:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;private struct union_values&lt;br /&gt;{&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;public Byte value_int8;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;public Int16 value_int16;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;public Int32 value_int32;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;public Int64 value_int64;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;public float value_single;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;public double value_double;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;public String value_string;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;public Decimal value_decimal;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);font-family:courier new;" &gt;....&lt;/span&gt;public DateTime value_datetime;&lt;br /&gt;};&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;This would be better as the equivalent of a C union rather than a C struct, since that would use less space in memory.  However, for the &lt;span style="font-family:courier new;"&gt;ToString()&lt;/span&gt;, &lt;span style="font-family:courier new;"&gt;Parse()&lt;/span&gt;, &lt;span style="font-family:courier new;"&gt;Write()&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;Read()&lt;/span&gt; methods, only the one actual value is input or output.  Another member is an enumerated type defined as follows:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;private enum datatype&lt;br /&gt;{&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;dt_int8,&lt;br /&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/code&gt;&lt;code&gt;dt_int16,&lt;br /&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/code&gt;&lt;code&gt;dt_int32,&lt;br /&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/code&gt;&lt;code&gt;dt_int64,&lt;br /&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/code&gt;&lt;code&gt;dt_single,&lt;br /&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/code&gt;&lt;code&gt;dt_double,&lt;br /&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/code&gt;&lt;code&gt;dt_decimal,&lt;br /&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/code&gt;&lt;code&gt;dt_string,&lt;br /&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/code&gt;&lt;code&gt;dt_datetime&lt;br /&gt;};&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;Each possible types has a member for returning a particular value, such as &lt;span style="font-family:courier new;"&gt;ToTinyInt()&lt;/span&gt;, &lt;span style="font-family:courier new;"&gt;ToSmallInt()&lt;/span&gt;, and so on.  These are all accessible from the SQL-side.  Each type also has an overloaded constructor.  The constructor is not accessible from SQL.&lt;br /&gt;&lt;br /&gt;Finally, &lt;span style="font-family:courier new;"&gt;AllType&lt;/span&gt; redefines the "&lt;" and "&gt;" operators.  This is needed for the comparisons for &lt;span style="font-family:courier new;"&gt;MinOf()&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;MaxOf()&lt;/span&gt;.  These are complicated by the fact that the two arguments can be of any type.  The comparisons follow the rules of SQL, so if either value is NULL then the comparisons return false.  Only numerics can be compared to each other, so int8 can be compared to double but not to a character string or datetime.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 102, 0); font-weight: bold;font-family:arial;font-size:180%;"  &gt;Creation Functions for &lt;span style="font-family: courier new;"&gt;AnyType&lt;/span&gt; and &lt;span style="font-family: courier new;"&gt;ValuePair&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;The following creation functions take a value of a particular type and return an &lt;span style="font-family:courier new;"&gt;AnyType&lt;/span&gt;:&lt;br /&gt;&lt;ul&gt;&lt;li style="font-family: courier new;"&gt;AnyTinyInt()&lt;/li&gt;&lt;li style="font-family: courier new;"&gt;AnySmallInt()&lt;/li&gt;&lt;li style="font-family: courier new;"&gt;AnyInt()&lt;/li&gt;&lt;li style="font-family: courier new;"&gt;AnyBigInt()&lt;/li&gt;&lt;li style="font-family: courier new;"&gt;AnyReal()&lt;/li&gt;&lt;li style="font-family: courier new;"&gt;AnyDouble()&lt;/li&gt;&lt;li style="font-family: courier new;"&gt;AnyDecimal()&lt;/li&gt;&lt;li style="font-family: courier new;"&gt;AnyDateTime()&lt;/li&gt;&lt;li style="font-family: courier new;"&gt;AnyString()&lt;/li&gt;&lt;/ul&gt;Adding an additional value type is quite simple.  The following need to be modified:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The &lt;span style="font-family:courier new;"&gt;union_value&lt;/span&gt; struct in &lt;span style="font-family:courier new;"&gt;AnyType&lt;/span&gt; needs to store the new type.&lt;/li&gt;&lt;li&gt;A new constructor needs to be added for the new value.&lt;/li&gt;&lt;li&gt;A new conversion function (To&lt;whatever&gt;).&lt;/whatever&gt;&lt;/li&gt;&lt;li&gt;Modifications to &lt;span style="font-family:courier new;"&gt;ToString()&lt;/span&gt;, &lt;span style="font-family:courier new;"&gt;Parse()&lt;/span&gt;, &lt;span style="font-family:courier new;"&gt;Write()&lt;/span&gt;, and &lt;span style="font-family:courier new;"&gt;Read()&lt;/span&gt;.&lt;/li&gt;&lt;li&gt;Modify the "&gt;" and "&lt;" operators. &lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold; color: rgb(0, 102, 0);font-size:180%;" &gt;&lt;span style="font-family:arial;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family: courier new;"&gt;ValuePair&lt;/span&gt;, in turn, has a creation function that takes two arguments of &lt;span style="font-family: courier new;"&gt;AnyType&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;The next post moves in a different direction, by talking about a particular type of data mining model, the &lt;span style="font-style: italic;"&gt;marginal value model&lt;/span&gt;.  The first post discusses how the model works rather than how it is implemented.&lt;br /&gt;&lt;script type="text/javascript"&gt;&lt;br /&gt;_uacct = "UA-380835-1";&lt;br /&gt;urchinTracker();&lt;br /&gt;&lt;br /&gt;&lt;/script&gt;</content><link rel='alternate' type='text/html' href='http://www.data-miners.com/dataminingsqlserver/2007/10/two-more-useful-aggregate-functions.html' title='Two More Useful Aggregate Functions:  MinOf() and MaxOf()'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8614680675185707253&amp;postID=1331409855675409866' title='0 Comments'/><link rel='replies' type='application/atom+xml' href='http://www.data-miners.com/dataminingsqlserver/atom.xml' title='Post Comments'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8614680675185707253/posts/default/1331409855675409866'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8614680675185707253/posts/default/1331409855675409866'/><author><name>Gordon S. Linoff</name></author></entry><entry><id>tag:blogger.com,1999:blog-8614680675185707253.post-7128236108106906040</id><published>2007-10-09T16:49:00.000-04:00</published><updated>2007-10-09T22:31:04.070-04:00</updated><title type='text'>Weighted Average Continued:  C# Code</title><content type='html'>&lt;div&gt;The previous post described how to load the function &lt;span style="font-family:courier new;"&gt;WAVG()&lt;/span&gt; into SQL Server. This post describes the code that generates the DLL.&lt;br /&gt;&lt;br /&gt;This discussion assumes that the reader is familiar with C# or object oriented languages similar to C#, such as C++ or java. That said, the code itself is probably readable by most people who are familiar with object-oriented programming practices.&lt;br /&gt;&lt;br /&gt;This discussion is composed of four parts:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Very basic discussion of Microsoft Visual Studio;&lt;/li&gt;&lt;li&gt;Overview of the code and auxiliary modules;&lt;/li&gt;&lt;li&gt;Code for Adding &lt;span style="font-family:courier new;"&gt;CreateWeightedValue()&lt;/span&gt; Function;&lt;/li&gt;&lt;li&gt;Code for Adding &lt;span style="font-family:courier new;"&gt;WAvg()&lt;/span&gt; Aggregation Function; and,&lt;/li&gt;&lt;li&gt;Code for Adding &lt;span style="font-family:courier new;"&gt;WeightedValue&lt;/span&gt; Type.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The last three specifically describe code. These are ordered by difficulty, since it is easiest to add a user defined function, then an aggregation, and then a type (at least in terms of the volume of code produced). The code containing these is available &lt;a href="http://www.data-miners.com/dataminingsqlserver/blog1.cs"&gt;here&lt;/a&gt;.&lt;/p&gt;&lt;span style="color: rgb(0, 102, 0);font-family:arial;font-size:180%;"  &gt;&lt;strong&gt;Overview of Microsoft Visual Studio&lt;/strong&gt;&lt;/span&gt;&lt;br /&gt;Microsoft Visual Studio is the application used to develop C# code (as well as code in other languages) using the .NET framework.&lt;br /&gt;&lt;br /&gt;Visual Studio divides work into units called &lt;em&gt;projects&lt;/em&gt;. These consist of one or more sets of files containing programming code, and they produce something. This something could be many things:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;A windows application;&lt;/li&gt;&lt;li&gt;A ".exe" file executed from the command line;&lt;/li&gt;&lt;li&gt;A library to be shared among other projects;&lt;/li&gt;&lt;li&gt;A dynamic load library (DLL);&lt;/li&gt;&lt;li&gt;A device driver;&lt;/li&gt;&lt;li&gt;and so on. &lt;/li&gt;&lt;/ul&gt;The thing that we want to create is a DLL, since this can be loaded as an assembly into SQL Server.&lt;br /&gt;&lt;br /&gt;For the purposes of this example, I have created a new project called blog in the directory c:\gordon\c-sharp\UserDefinedFunctions. The screen looks like:&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center;" alt="" src="http://www.data-miners.com/dataminingsqlserver/uploaded_images/newproject-784279.jpg" border="0" /&gt; &lt;/p&gt;Creating such a file automatically opens a code file. The source code for &lt;span style="font-family:courier new;"&gt;Wavg()&lt;/span&gt; is can be copied and placed into this file.After the code is in place, the file is created by going to the &lt;span style="font-family:courier new;"&gt;Build--&gt;Build Blog&lt;/span&gt; menu option. Any errors appear at the bottom of the screen. Visual Studio does a good job of catching errors during the compilation process.&lt;br /&gt;&lt;p&gt;Once the project has been built, the DLL is conveniently located in the path &lt;project&gt;\blog\blog\bin\debug\blog.dll. It can be loaded into SQL Server from this location, copied to a more convenient location, and even emailed or moved onto another computer.&lt;/project&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;Obviously, there is much more to say about Visual Studio. For that, I recommend Microsoft documentation or simply playing with the tool.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 102, 0);font-family:arial;font-size:180%;"  &gt;&lt;strong&gt;Overview of Code and Auxiliary Modules&lt;/strong&gt;&lt;/span&gt;&lt;br /&gt;Converting C# code into a DLL that can be loaded into SQL Server is a tricky process. In particular, .NET has to be sure that streams of bits represent what they are supposed to represent in both C# and SQL Server. In fact, this is a problem. For instance, by default, database values can be NULL. And yet, this is not part of any native C# type. To support compatibility between the systems, the code includes various using clauses and compiler directives.&lt;/p&gt;However, the bulk of the C# code for this project consists primarily of three class definitions. The class &lt;span style="font-family:courier new;"&gt;WeightedValue&lt;/span&gt; defines the type weighted value, which holds a numeric value and a numeric weight (as C# doubles). The class &lt;span style="font-family:courier new;"&gt;WAvg&lt;/span&gt; defines the aggregation function. Finally, the &lt;span style="font-family:courier new;"&gt;CreateWeightedValue()&lt;/span&gt; function is a member of another class, &lt;span style="font-family:courier new;"&gt;UserDefinedFunctions&lt;/span&gt;. Note that the names of the first two classes match the names of the type and aggregation function respectively. The name of the third class is arbitrary, but carefully chosen to convey the notion that it contains user defined functions.&lt;br /&gt;&lt;p&gt;The beginning of the C# library consists of a series of "using" steps. These specify additional modules used by C#, and are similar to the "#include" preprocessor directive in C and C++ code. For instance, this code has the following references:&lt;/p&gt;&lt;code&gt;    using System;&lt;br /&gt;using System.IO;&lt;br /&gt;using System.Data.SqlTypes;&lt;br /&gt;using Microsoft.SqlServer.Server;&lt;/code&gt;&lt;p&gt;&lt;/p&gt;The first two specify various system classes that are commonly used. The last specifies classes used specifically for interfacing to Sql Server.&lt;br /&gt;&lt;br /&gt;The third is the most interesting, because it defines the classes that contain data going between SQL Server and C#. These are SQL data types. For instance, &lt;span style="font-family:courier new;"&gt;FLOAT&lt;/span&gt; in SQL corresponds to &lt;span style="font-family:courier new;"&gt;SqlDouble&lt;/span&gt; in C#. Basically, the C# classes encapsulate the basic class with a NULL flag.&lt;br /&gt;&lt;br /&gt;However, there are some subtleties when passing data back and forth. "&lt;span style="font-family:courier new;"&gt;CHAR()&lt;/span&gt;" is not supported, although "&lt;span style="font-family:courier new;"&gt;NCHAR(&lt;length&gt;)&lt;/length&gt;&lt;/span&gt;" is. Fortunately, SQL Server automatically converts between these types.&lt;br /&gt;&lt;br /&gt;More insidious is the fact that the length of strings and numerics and decimals and money all have to be specified. So, I have not figured out how create a function that takes arbitrary numeric values. User defined functions can only take numerics of a given length. Of course, we can define our own numeric value that never overflows. More typically, though, we simply declare functions to take a &lt;span style="font-family:courier new;"&gt;FLOAT&lt;/span&gt;. This is sufficient for most purposes and gets passed to C# as &lt;span style="font-family:courier new;"&gt;SqlDouble&lt;/span&gt;. For characters, we define them to take some long character value, such as &lt;span style="font-family:courier new;"&gt;NVARCHAR(2000)&lt;/span&gt;, which is converted to SqlString.&lt;br /&gt;&lt;br /&gt;More complete matching tables are available in Microsoft documentation: &lt;a href="http://msdn2.microsoft.com/en-us/library/system.data.sqltypes%28vs.71%29.aspx"&gt;http://msdn2.microsoft.com/en-us/library/system.data.sqltypes(vs.71).aspx&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;In addition to the &lt;span style="font-family:courier new;"&gt;using&lt;/span&gt; statement, there are also compiler directives. These are applied to classes and to member in classes, as we will see below.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 102, 0);font-family:arial;font-size:180%;"  &gt;&lt;strong&gt;Code for Adding &lt;span style="font-family:courier new;"&gt;CreateWeightedAverage()&lt;/span&gt; Function&lt;/strong&gt;&lt;/span&gt;&lt;br /&gt;The following code provides the full definition for the &lt;span style="font-family:courier new;"&gt;CreateWeightedAverage()&lt;/span&gt; function.&lt;br /&gt;&lt;code&gt;&lt;pre&gt;public partial class UserDefinedFunctions&lt;br /&gt;{&lt;br /&gt;    [Microsoft.SqlServer.Server.SqlFunction]&lt;br /&gt;    public static WeightedValue&lt;br /&gt;    CreateWeightedValue (SqlDouble val, SqlDouble wgt)&lt;br /&gt;    {&lt;br /&gt;        if (val.isNull wgt.IsNull)&lt;br /&gt;        {&lt;br /&gt;            return WeightedValue.Null;&lt;br /&gt;        }&lt;br /&gt;        return new WeightedValue(val, wgt);&lt;br /&gt;    } // CreateWeightedValue()&lt;br /&gt;} // UserDefinedFunctions&lt;br /&gt;&lt;/pre&gt;&lt;/code&gt;&lt;p&gt;&lt;/p&gt;This code defines a class called UserDefinedFunctions. The &lt;span style="font-family:courier new;"&gt;partial&lt;/span&gt; keyword simply means that the class definition may be split over several source code files. In this case, that is not the case.&lt;br /&gt;&lt;br /&gt;The function itself is a static member of this class, so it can be called without an instance of the class being created. In fact, the function is going to be called from SQL Server. The function itself starts with a compiler directive that specifies that this is, in fact, a SQL Server function.&lt;br /&gt;&lt;br /&gt;The remainder of the function declaration specifies the arguments and the return type. The code in the body is quite simply.&lt;br /&gt;&lt;br /&gt;Recall from the last posting that this function is added into SQL Server using the following code:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;pre&gt;CREATE FUNCTION CreateWeightedValue (@val float, @wgt float)&lt;br /&gt;RETURNS WeightedValue as&lt;br /&gt;EXTERNAL NAME ud.UserDefinedFunctions.CreateWeightedValue&lt;br /&gt;&lt;/pre&gt;&lt;/code&gt;&lt;br /&gt;&lt;p&gt;&lt;/p&gt;This shows the correspondence between the SQL Server and C# language elements.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 102, 0);font-family:arial;font-size:180%;"  &gt;&lt;strong&gt;Code for Adding &lt;span style="font-family:courier new;"&gt;WAvg()&lt;/span&gt; Aggregation Function&lt;/strong&gt;&lt;/span&gt;&lt;br /&gt;The code for an aggregation is more complicated than the code for a single function, so the entire code is not included here. An aggregation class has the following structure:&lt;br /&gt;&lt;code&gt;&lt;pre&gt;[Serializable]&lt;br /&gt;[SqlUserDefinedAggregate(Format.UserDefined)&lt;br /&gt;public class WAvg : IBinarySerialize&lt;br /&gt;{&lt;br /&gt;    private double sum;&lt;br /&gt;    private double cnt;&lt;br /&gt;&lt;br /&gt;    public void Init ()&lt;br /&gt;    { . . . }&lt;br /&gt;&lt;br /&gt;    public void Accumulate (WeightedValue value)&lt;br /&gt;    { . . . }&lt;br /&gt;&lt;br /&gt;    public void Merge (WAvg other)&lt;br /&gt;    { . . . }&lt;br /&gt;&lt;br /&gt;    public SqlDouble Terminate ()&lt;br /&gt;    { . . . }&lt;br /&gt; &lt;br /&gt;    public void Write (BinaryWriter w)&lt;br /&gt;    { . . . }&lt;br /&gt;&lt;br /&gt;    public void Read (BinaryReader r)&lt;br /&gt;    { . . . }&lt;br /&gt;}  // WAvg&lt;br /&gt;&lt;/pre&gt;&lt;/code&gt;&lt;p&gt;&lt;/p&gt;The way this function works is quite simple. It maintains a running sum and sum of weights, When finished, it returns the sum divided by the sum of the weights. Notice that an aggregation function is really a class that contains data, along with some members of that class.&lt;br /&gt;&lt;br /&gt;More interesting is the definition itself. First, the function has two compiler directives. The first is &lt;span style="font-family:courier new;"&gt;[Serializable]&lt;/span&gt;. This directive means that the data in the class can be passed back and forth between SQL Server and C#. In particular, it means that there are two special functions that are going to be defined, &lt;span style="font-family:courier new;"&gt;Write()&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;Read()&lt;/span&gt;. These are never called explicitly, but are part of the interface. These functions "write" the data to memory and then "read" it back . . . the data is written in one process (SQL Server or C#) and then read back in the other.&lt;br /&gt;&lt;br /&gt;The second compiler directive specifies that the class is for an aggregation function. Because the type has "complicated" data, C# does not know how to read and write it automatically. This is true of almost all data types, so the format is typically UserDefined. In addition, other options are available as well, and explained in Microsoft documentation.&lt;br /&gt;&lt;br /&gt;The class itself is an instance of IBinarySerialize. This is also part of the serialization stuff. Also, once this inheritance is set up, the &lt;span style="font-family:courier new;"&gt;Write()&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;Read()&lt;/span&gt; functions must be defined or else there is a compiler error.&lt;br /&gt;&lt;br /&gt;The other four functions in the interface are actually useful for doing the aggregation. They are not called explicitly, but are used by SQL Server to do the work. The first is &lt;span style="font-family:courier new;"&gt;Init()&lt;/span&gt;, which initializes the values in the class to start a new aggregation. In this case, it sets the sum and the weight to zero.&lt;br /&gt;&lt;br /&gt;The function &lt;span style="font-family:courier new;"&gt;Accumulate()&lt;/span&gt; adds in another value. Unfortunately, the accumulation function can only take one argument, which is why we need to create a special type that contains two values. In this case, it simply increments the sum and weight values.&lt;br /&gt;&lt;br /&gt;The third function &lt;span style="font-family:courier new;"&gt;Merge()&lt;/span&gt; is probably the lead obvious of the four functions. This function merges two aggregation values. Why would this ever happen? The reason is parallelism. SQL Server might separate the aggregation into multiple threads for performance reasons. This brings together the intermediate results. One super nice thing about this structure is that we get the benefits of parallel, multi-threaded performance without really having to think about it. A very nice thing indeed.&lt;br /&gt;&lt;br /&gt;The final function &lt;span style="font-family:courier new;"&gt;Terminate()&lt;/span&gt; is the most harshly named of the four. It returns the final value, in this case as a SQL floating point value (which is equivalent to a C# double).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 102, 0);font-family:arial;font-size:180%;"  &gt;&lt;strong&gt;Code for Adding &lt;span style="font-family:courier new;"&gt;WeightedValue&lt;/span&gt; Type&lt;/strong&gt;&lt;/span&gt;&lt;br /&gt;The final section is for adding in the type. This is similar to the aggregation class, although slightly different.&lt;br /&gt;&lt;br /&gt;The following shows the various functions in the user defined type.&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;pre&gt;[Serializable]&lt;br /&gt;[Microsoft.SqlServer.Server.SqlUserDefinedType(Format.UserDefined)&lt;br /&gt;public class WeightedValue : INullable, IBinarySerialize&lt;br /&gt;{&lt;br /&gt;    private bool isNull;&lt;br /&gt;    private double value;&lt;br /&gt;    private double weight;&lt;br /&gt;&lt;br /&gt;    public WeightedValue ()&lt;br /&gt;    { . . . }&lt;br /&gt;&lt;br /&gt;    public bool IsNull&lt;br /&gt;    { . . . }&lt;br /&gt;&lt;br /&gt;    public static WeightedValue Null&lt;br /&gt;    { . . . }&lt;br /&gt;&lt;br /&gt;    public override string ToString ()&lt;br /&gt;    { . . . }&lt;br /&gt;&lt;br /&gt;    public static WeightedValue Parse (SqlString s)&lt;br /&gt;    { . . . }&lt;br /&gt;&lt;br /&gt;    public void Write (BinaryWriter w)&lt;br /&gt;    { . . . }&lt;br 