Thursday, December 17, 2009

What do group members have in common?

We received the following question via email.


I have a data set which has both numeric and string attributes. It is a data set of our customers doing a particular activity (eg: customers getting one particular loan). We need to find out the pattern in the data or the set of attributes which are very common for all of them.

Classification/regression not possible , because there is only one class
Association rule cannot take my numeric value into consideration
clustering clusters similar people, but not common attributes.

 What is the best method to do this? Any suggestion is greatly appreciated.

The question "what do all the customers with a particular type of loan have in common"  sounds seductively reasonable. In fact, however, the question is not useful at all because the answer is "Almost everything."  The proper question is "What, if anything, do these customers have in common with one another, but not with other people?"  Because people are all pretty much the same, it is the tiny ways they differ that arouse interest and even passion.  Think of two groups of Irishmen, one Catholic and one Protestant. Or two groups of Indians, one Hindu and one Muslim. If you started with members of only one group and started listing things they had in common, you would be unlikely to come up with anything that didn't apply equally to the other group as well.

So, what you really have is a classification task after all.  Take the folks who have the loan in question and an equal numbers of otherwise similar customers who do not. Since you say you have a mix of numeric and string attributes, I would suggest using decision trees. These can split equally well on numeric values ( x>n ) or categorical variables ( model in ('A','B','C') ). If the attributes you have are, in fact, able to distinguish the two groups, you can use the rules that describe leaves that are high in holders of product A as "what holders of product A have in common" but that is really shorthand for "what differentiates holders of product A from the rest of the world."

Labels: , ,

Wednesday, November 4, 2009

Scoring Association Rules

At M2009 (SAS's data mining conference), I was approached with the question of scoring association rules for customers. This is not a topic I have thought about very much. More typically, association rules are used qualitatively or to understand products. I hadn't thought about assigning the "best" rule (or rules) back to customers.

As a reminder, association rules provide information about items that are purchased at the same time. For example, we might find that marshmallows and chocolate bars imply graham crackers. The "marshmallows" and "chocolate bars" are the left hand side of the rule (LHS) and the graham crackers is the right hand side (RHS). The presumption is that when graham crackers are missing from a shopper's basket, then they should be there.

Most data mining software, such as SAS Enterprise Miner, SQL Server Data Mining, and SPSS Clementine, can be used to generate association rules. I prefer to calculate the rules myself using database technology, using code similar to that in Data Analysis Using SQL and Excel.

However, data mining tools do not provide the ability to score association rules for individual customers. Neither is this is a topic that I discuss in my book. My goal here is to discuss scoring rules in databases. This is because scoring is computationally expensive. Because databases can take advantage of indexing and parallel processing, they offer scope to make the score more efficient.

Hmmm, what does scoring association rules even mean? Scoring is the process of finding the best rule that a customer matches, either for a single RHS or for all possible RHSs. In the former case, the result is one rule. In the latter, it is an array of rules, for each possible RHS.

An association rule is traditionally defined by three metrics: support, confidence, and lift (as well as a fourth, the chi-square metric, which I prefer). For the purposes of this discussion, the best rule is the one with the highest confidence.

The simplistic way of doing such scoring is by considering each rule for each customer, to determine which rules apply to each customer. From the set that do apply, do some work to find the best one.

Imagine that we have a table, rules, with the following columns:
  • The number of LHS items (we assume there is 1 RHS item);
  • The RHS item.
  • The LHS items, as a string: "item1;item2;..."
There is another table, custitem, containing each customer and each item as a separate row.

The following query find all matching rules for each customer in the innermost subquery, by counting the number of items matched on the left hand side. The outer query then finds the rule (for each RHS) that has the maximum confidence, using SQL window functions.

FROM (SELECT customerid, r.rhs, r.ruleid,

. ...........(MAX(r.confidence) OVER (PARTITION BY customerid, rhs)
.............) as maxconfidence
......FROM (SELECT ci.customerid, r.rhs, r.ruleid,
.................COUNT(*) as nummatches,
............FROM custitem ci CROSS JOIN
.................rules r
............WHERE CHARINDEX(ci.item||';', r.lhs) > 0
............GROUP BY ci.customerid, r.rhs, r.ruleid
............HAVING COUNT(*) = MAX(r.numlhs)
...........) matchrules JOIN
...........rules r
...........ON matchrules.ruleid = rules.ruleid
......) cr
WHERE confidence = maxconfidence

This query is expensive, as you might guess from the use of CROSS JOIN. And, its performance gets longer particularly as the number of rules gets larger (and presumably the number of customers is larger still).

It is possible to make it more efficient, by doing tricks, such as:
  • If there are a few number of items, then the LHS could be encoded using bits. This eliminates the need for string matching.
  • The rules can be pruned, so only the rules with the highest confidence are kept.
And, although this cannot be done in SQL, the rules could be ordered by confidence (for each RHS) from highest to lowest. The first match would then stop the search.

An alternative method requires storing the rules in two tables. The first is rules, containing descriptive information about each rule, such as:
  • ruleid;
  • rhs; and,
  • numlhs.
The second is ruleitem, which contains each item in the rules. Incidentally, this is more in keeping with the spirit of normalization in relational databases.

The subquery for the scoring now changes to a join. This is useful, because it means that we can use database mechanisms -- such as indexing and table partitioning -- to speed it up.

FROM (SELECT customerid, r.rhs, r.ruleid,
.............(MAX(r.confidence) OVER (PARTITION BY customerid, rhs)
.............) as maxconfidence
......FROM (SELECT ci.customerid, r.rhs, r.ruleid,
...................COUNT(*) as nummatches, MAX(numlhs) as numlhs
............FROM custitem ci JOIN
.................ruleitems ri
.................ON ci.item = ri.item JOIN
.................rule r
.................ON ri.ruleid = ri.ruleid
............GROUP BY ci.customerid, r.rhs, r.ruleid
............HAVING COUNT(*) = MAX(r.numlhs)
...........) matchrules JOIN
...........rules r
...........ON matchrules.ruleid = rules.ruleid
......) cr
WHERE confidence = maxconfidence

Of course, such an approach makes a difference only when you need to score many customers and you have many rules. This same approach can be used for looking at a single product in the RHS or at several at one time. Of course, this would require summarizing the multiple products at the customer level in order to append the desired information on the customer record.

Labels: , ,

Friday, October 16, 2009

SVM with redundant cases

We received the following question from a reader:

I just discovered this blog -- it looks great. I apologize if this question has been asked before -- I tried searching without hits.

I'm just starting with SVMs and have a huge amount of data, mostly in the negative training set (2e8 negative examples, 2e7 positive examples), with relatively few features (eg less than 200). So far I've only tried linear SVM (liblinear) due to the size, with middling success, and want to under-sample at least the negative set to try kernels.

A very basic question. The bulk of the data is quite simple and completely redundant -- meaning many examples of identical feature sets overlapping both positive and negative classes. What differs is the frequency in each class. I think I should be able to remove these redundant samples and simply tell the cost function the frequency of each sample in each class. This would reduce my data by several orders of magnitude.

I have been checking publications on imbalanced data but I haven't found this simple issue addressed. Is there a common technique?

Thanks for any insight. Will start on your archives.
There are really two parts to the question. The first part is a general question about using frequencies to reduce the number of records. This is a fine approach. You can list each distinct record only once along with its frequency. The frequency counts how many times a particular pattern of feature values (including the class assigned to the target) appears. The second part involves the effect on the SVM algorithm of having many cases with identical features but different assigned classes. That sounded problematic to me, but since I am not an expert on support vector machines, I forwarded your question to someone who is--Lutz Hamel, author of Knowledge Discovery with Support Vector Machines.

Here is his reply:

I have some fundamental questions about the appropriateness of SVM for this classification problem:

Identical observation feature vectors produce different classification outcomes. If this is truly meaningful then we are asking the SVM to construct a decision plane through a point with some of the examples in this point classified as positive and some as negative. This is not possible. This means one of two things: (a) we have a sampling problem where different observations are mapped onto the same feature vectors. (b) we have a representation problem where the feature vector is not powerful enough to distinguish observations that should be distinguished.

It seems to me that this is not a problem of a simple unbalanced dataset but a problem of encoding and perhaps coming up with derived features that would make this a problem suitable for decision plane based classification algorithms such as SVMs. (is assigning the majority label to points that carry multiple observations an option?)
SVM tries to find a hyperplane that separates your classes. When, (as is very common with things such as marketing response data, or default, or fraud, or pretty much any data I ever work with), there are many training cases where identical values of the predictors lead to different outcomes, support vector machines are probably not the best choice. One alternative you could consider is decision trees. So long as there is a statistically significant difference in the distribution of the target classes, a decision tree can make splits. Any frequently occuring pattern of features will form a leaf and, taking the frquencies into account, the proportion of each class in the leaf provides estimates of the probabilities for each class given that pattern.

Labels: , ,

Tuesday, September 15, 2009

Adjusting for Oversampling

We recently received two similar questions about oversampling . . .

If you don´t mind, I would like to ask you a Question regarding Oversampling as you wrote in your book (Mastering Data Mining...).

I can understand how you calculate predictive lift when using oversampling, though don´t know how to do it for the confusion matrix.

Would you mind telling me how do I compute then the confusion matrix for the actual population (not the oversampled set)?

Thanks in advance for your reply and help.



I have severely unbalanced training data (180K negative cases, 430 positive cases). Yeah...very unbalanced.

I fit a model in a software program that allows instance weights (weka). I give all the positive cases a weight of 1 and all the negative cases a weight of 0.0024. I fit a model (not a decision tree so running the data through a test set is not an option to recalibrate) - like a neural network. I output the probabilities and they are out of whack - good for predicting the class or ranking but not for comparing predicted probability against actual.

What can we do to fit a model like this but then output probabilities that are in line with the distribution? Is this new (wrong) probabilities just the price we have to pay for instance weights to (1) get a model to build (2) get reasonably good classification? Can I have my cake and eat it too (classification and probs that are close to actual)?

Many many thanks!

The problem in these cases is the same. The goal is to predict a class, usually a binary class, where one outcome is rarer than the other. To generate the best model, some method of oversampling is used so the model set has equal numbers of the two outcomes. There are two common ways of doing this. Diego is probably using all the rare outcomes and an equal-sized random sample of the common outcomes. This is most useful when there are a large number of cases, and reducing the number of rows makes the modeling tools run faster. Brian is using a method where weights are used for the same purpose. Rare cases are given a weight of 1 and common cases are given a weight less than 1, so that the sum of the weights of the two groups is equal.

Regardless of the technique (neural network, decision trees, logistic regression, neearest neighbor, and so on), the resulting probabilities are "directionally" correct. A group of rows with a larger probability are more likey to have the modeled outcome than a group with a lower probability. This is useful for some purposes, such as getting the top 10% with the highest scores. It is not useful for other purposes, where the actual probability is needed.

Some tools can back into the desired probabilities, and do correct calculations for lift and for the confusion matrix. I think SAS Enterprise Miner, for instance, uses prior probabilties for this purpose. I say "think" because I do not actually use this feature. When I need to do this calculation, I do it manually, because not all tools support it. And, even if they do, why bother learning how. I can easily do the necessary calculations in Excel.

The key idea here is simply counting. Assume that we start with data that is 10% rare and 90% common, and we oversample so it is 50%-50%. The relationship between the original data and the model set is:
  • rare outcomes: 10% --> 50%
  • common outcomes: 90% --> 50%
To put it differently, each rare outcome in the original data is worth 5 in the model set. Each common outcome is worth 5/9 in the model set. We can call these numbers the oversampling rates for each of the outcomes.

We now apply these mappings to the results. Let's answer Brian's question for a particular situation. Say we have the above data and a result has a modeled probability of 80%. What is the actual probability?

Well, 25% means that there is 0.25 rare outcomes for 0.75 common ones. Let's undo the mapping above:
  • 0.80 / 5 = 0.16
  • 0.20 / (5/9) = 0.36
So, the expected probability on the original data is 0.16/(0.16+0.36) = 30.8%. Notice that the probability has decreased, but it is still larger than the 10% in the original data. Also notice that the lift on the model set is 80%/50% = 1.6. The lift on the original data is 3.08 (30.8% / 10%). The expected probability goes down, and the lift goes up.

This calculation can also be used for the cross-correlation matrix (or confusion matrix). In this case, you just have to divide each cell by the appropriate overampling rate. So, if the confusion matrix said:
  • 10 rows in the model set are rare and classified as rare
  • 5 rows in the model set are rare and classified as common
  • 3 rows in the model set are common and classified as rare
  • 12 rows in the model set are common and classified as common
(I apologize for not including a table, but that is more trouble than it is worth in the blog.)

In the original data, this means:
  • 2=10/5 rows in the original data are rare and classified as rare
  • 1=5/5 rows in the original data are rare and classified as common
  • 5.4 = 3/(5/9) rows inthe original data are common and classified as rare
  • 21.6 = 12/(5/9) rows in the original data are common and classified as common
These calculations are quite simple, and it is easy to set up a spreadsheet to do them.

I should also mention that this method readily works for any number of classes. Having two classes is simply the most common case.

Labels: , ,

Monday, September 7, 2009

Principal Components: Are They A Data Mining Technique?

Principal components have been mentioned in passing several times in previous posts. However, I have not ever talked specifically about them, and their relationship to data mining in general.

What are principal components? There are two common definitions that I do not find particularly insightful. I repeat them here, mostly to illustrate the distance from important mathematical ideas and their application. The first definition is that the principal components are the eigenvectors of the covariance matrix of the variables. The eigenwhats of the what? Knowing enough German to understand that "eigen" means something like "inherent" does not really help in understanding this. An explanation of this -- with lots of mathematical symbols -- is available on Wikipedia. (And, it is not surprising that the inventor of covariance Karl Pearson also invented principal component analysis.)

The second definition (which is equivalent to the first) starts by imagining the data as points in space. Off all the possible lines in the space, the first principal component is the line that maximizes the variance of the points projected on the line (and also goes through the centroid of the data points). Points, lines, projections, centroids, variance -- that also sounds a bit academic. (By the way, for the seriously mathematically inclined, here are pointers to how these defintions are the same.)

I prefer a third, less commonly touted definition, which also assumes that data is spread out as points in space. Of all possible lines in space, the first principal component is the one that minimizes the square of the distance from each data point to the line. Hey, you may be asking, "isn't this the same as the ordinary least squares regression line?" The reason why I like this approach is because it compares principal components to something that almost everyone is familiar with -- the best-fit line. And that provides an opportunity to compare and contrast and learn.

The first difference between the two is both subtle and important. The best-fit line only looks at the distance from each data point to the line along one dimension; that is, the line minimizes the sum of the squares of the differences along the target dimension ("y"). The first principal component is looking at the sum of the squares of the overall distance. The "distance" in this case is the length of the shortest vector that connects each point to the line. In general, the best fit line and the first principal component, are not the same (and I'm curious if the angle between them might be useful). A little known factoid about best fit lines is worth dropping in here. Given a set of data points (x, y), the best fit line that fits y = f(x) is different from the best fit line that fits x = f(y). And the first principal component fits "between" these lines in some sense.

There is a corollary to this. For a best-fit line, one dimension is special, the "y" dimension, because that is how the distance is measured. This is typically the target dimension for a model, the values we want to predict. For the first principal component, there is no special dimension. Hence, principal components are most useful when applied only to input variables without the target. A major difference from best-fit lines.

For me, it makes intuitive sense that the line that best fits input values would be useful for analysis. And, it makes intuitive sense in a way that the eigen-whatevers of some matrix do not intuitively say "useful" or even that the line that maximizes the variance does not say "useful". Even though all are doing the same thing, some ways of explaining the concept seem more intuitive and applicable to data analysis.

Another difference from the best fit line involves what statisticians call residuals -- that is, the difference from each of the original data points to the corresponding point on the line. For a best-fit line, the residuals are simply numbers, the difference between the original "y" and the "y" on the line. For the first principal component, the residuals are vectors -- the vectors that connect each point perpendicularly to the line. These vectors can be plotted in space. And, given a bunch of points in space, we can calculate the principal component for them. This is the second principal component. And these have residuals, and the process can keep going, for a while, yielding the third principal component, and so on.

The first principal component and the second principal component have a very particular property; they are orthogonal to each other, which means that they meet at a right angle. In fact, all principal components are orthogonal to each other, and orthogonality is a good thing when working with input values for data. So, it is tempting to replace the data with the first few principal components. It is not only tempting, but this is often a successful way to reduce the number of variables used for analysis.

By the way, there are not an infinite number of principal components. The number of principal components is the dimensionality of the original data points -- which is never more than the number of variables that define each point.

There is much more to say about principal components. The original question asked whether they are part of data mining. I have never been particularly proud of what is and what is not data mining -- I'm happy to include anything useful for data analysis under the heading. Unlike other techniques, though, principal components are not a fancy method for building predictive or descriptive models. Instead, they are part of the arsenal of tools available for managing and massaging input variables to maximize their utility.

Labels: , ,

Friday, August 28, 2009

Shazam, A Case Study in Memory Based Reasoning (MBR)

Many users are probably aware of Shazam, one of the few mobile applications that really seems to live up to the notion "Wow! It's Magic." When you are listening to music, you can run the application (presumably on your phone), click "tag it", and after a few tens of seconds of listening and processing, Shazam will tell you what you are listening to, the artist and other details.

Recently, I found an interesting article describing how it works. A presentation, with more pictures and less text, is available here. Kudos to the company and to the author, Avery Wang, for providing technical detail.

The paper does a very good job explaining the details of the algorithm. My goal here is to describe the algorithm from a higher perspective, because it is an interesting example of a memory-based reasoning algorithm. That is, an algorithm that combines information from "nearest neighbors" to arrive at a prediction.

Assume that we have a database of songs and an excerpt that we want to find in a database of millions of songs. A first approach might be to do an exhaustive search of the database to find a match. This would take a long time.

Alternatively, we can frame the problem as follows: for all songs in the database, what is the longest period of time where the excerpt overlaps part of a song. The nearest neighbors are the ones with the longest period, and, in general, we would choose the single one with the longest overlap.

Simple problem to describe. However, the real world hits quickly. The songs are probably quite clean acoustically. However, the excerpt is subject to numerous problems: background noise, loss of fidelity due to compression as the excerpt is transmitted, poor (or at least different) equipment for recording the excerpt, and so on.

Fortunately, the world of acoustics has something of a solution for this, called the "frequency domain". This is a map of all the sound frequencies, taken at a periodic interval -- say, one second. If "frequency domain" conjures up memories of things like Fourier Transforms, then you really do understand the subject.

However, for our purposes, it is enough to say that the frequency domain for a song produces a very, very bumpy curve for each second of the song -- each point is the strength of a particular acoustic frequency at that point in the song. The song can be thought of as a collection of all these curves. Taken together, these curves might resemble a map of a very hilly area. This would be a three-dimensional map of the song.

This map has peaks anologous to the tops of hills (or perhaps the tops of buildings in a city). These peaks are called a constellations, and they pretty much uniquely identify the song, regardless of all the problems mentioned above. That is, the constellations are resilitient to background noise, loss of fidelity, and so on.

Of course, we can do this for the songs in the database in advance. And, we can do this processing for a single excerpt pretty quickly.

So, the problem of finding the song with the longest overlap in seconds with the excerpt is now handled by finding consecutive seconds in a song where the frequency domain peaks match the frequency domain peaks from the excerpt. This is still a daunting problem, because there are so many peaks available. In other word, comparing one excerpt to millions of songs requires comparing hundreds of peaks in the excerpts to the many, many billions in the database -- very time consuming.

Shazam takes a very clever approach to this problem. The algorithm treats each peak as an anchor, and creates peak-pairs with other peaks "close" to the anchor. Here, "close" means that the other peaks are within a few seconds of the first and not too different in frequency. These peak-pairs are then calculated for both the song and the excerpt. The pairs are used to find sets of anchors that match between each song and the excerpt. Because the algorithm is looking for exact matches, it can use some programming tricks to make things even faster (these are described in the paper).

In the end, there is a set of anchors for each song matched by a given excerpt. For each song, these are scanned to find consecutive seconds where the anchors in each second overlap. The longest period of overlap is the distance between the excerpt and the song.

The algorithm is quite clever on several different levels. I do think that understanding it at a high level is valuable, especially since it can provide guidance to other very difficult recognition problems. On the other hand, when I use Shazam and it identifies a song, I still think it's magic.

Labels: , ,

Tuesday, August 25, 2009

Neural Networks, Predicting Continuous Values

Very good blog...
I'm doing some stuff with Clementine... and I have an issue...
My target for NN train dataset is a continuos value between 0 and 100... the problem is that is a normal/gaussian distribution and makes the NN predict bad...

How can I resolve the unbalancing data? split into classe with same frequency!?



I am not aware that neural networks have a problem with predicting values with normal distributions. In fact, if you randomize the weights in a neural network whose output layer has a linear transfer function, then the output is likely to follow a normal distribution -- just from the Central Limit Theorem of statistics.

So, you have a neural network that is not producing good results. There can be several causes.

The first thing to look for is too many inputs. Clementine has options to prune the input variables on a neural network. Be sure that you do not have too many inputs. I would recommend a variable reduction technique such as principal components, and advise you to avoid categorical variables that have many levels.

A similar problem can occur if your hidden layer is too large.

Whatever the network, it is worthwhile looking at the number of weights in the network (or a related measure called the degrees of freedom). Remember, you want to have lots of training data for each weight.

Another problem may be that the target is continuous, but bounded between 0 and 100. This could result in a neural network where the output layer uses a linear transfer function. Although not generally a bad idea, it may not work in this case because the range of a linear function is from minus infinity to positive infinity, which far exceeds the range of the data.

One simple solution would be to divide the output by 100 and treat it as a probability. The neural network should then be set up with a logistic function in the target layer.

Your idea of binning the results might also work, assuming that bins work for solving the business problem. Equal sized bins are reasonable, since they are readily understandable as quantiles.

Good luck.

Labels: , ,

Friday, June 26, 2009

When Customers Start and End

In texts on credit scoring, some effort almost always goes into defining what is to be considered as a "bad" credit. The Basel framework provides rather a precise definition of what is to be considered a default.

But I have rarely seen the same in predicting cross-sell, up-sell or churn. I do however, remember attending an SPSS conference where churn of pre-paid cards was discussed. Churn, in that case, was defined as a number of consecutive periods where the number of calls fell below a certain level.

In the past, I've used start and end dates of contracts, as well as a simple increase (or decrease) in the number of products that a customer has over time as indicators of what to target.

I'd be really interested in hearing how you define and extract targets, be it in telecom, banking, cards or any other business where you use prediction. For instance, how would you go looking for customers that have churned? Or for that matter, customers where up-sell has been successful?

This may be too simple a question, but if there are standard methods that you use, I'd be really interested in learning about them.


This is not a simple question at all. Or rather, the simplest questions are often the most illuminating.

The place where I see the biggest issues in defining starts and stops is in survival data mining (obligatory plug for my book Data Analysis Using SQL and Excel, which has two chapters on the subject). For the start date, I try to use (or approximate as closely as possible) the date when two things have occurred: the company has agreed to provide a product or service, and the customer has agreed to pay for it. In the case of post-pay telecoms, this would be the activation date -- and there are similar dates in many other industries, as varied as credit cards, cable subscriptions, and health insurance.

The activation date is often well-defined because the number of active customers gets reported through some system tied to the financial systems. Even so, there are anomalies. I recently completed a project at a large newspaper, and used their service start date as the activation date. Alas, at time, customers with start dates did not necessarily actually receive the paper on the date -- often because the newspaper delivery person could not find the address.

The stop date is even more fraught with complication, because there are a variety of different dates to choose from. For voluntary churn, there is the date the customer requests termination of the service. There is also the date when the service is actually turned off. Which to use? It depends on the application. To count active customers, we want the service cut-off date. To plan for customer retention efforts, we want to know when they call in.

Involuntary churn is also complicated, because there are a series of steps, often called the Dunning Process, which keeps track of customers who do not pay. At what point does a non-paying customer stop? When the service stops? When the bill is written off or settled? At some arbitrary point, such as 60 or 90 days of non-payment? To further confuse the situation, the business may change its rules over time. So, during some periods of time or for some customers, 60 days of non-payment results in service cutoff. For other periods or customers, 90 days might be the rule.

Often, I find multiple time-to-event problems in this scenario. How long does it take a non-paying customer to stop, if ever? How long after customers sign up do they begin?

In your particular case, the contract start date is probably a good place to start. However, the contract end date might or might not be appropriate, since this might not be updated to reflect when a customer actually stops.


Labels: , ,

Sunday, January 18, 2009

Thoughts on Understanding Neural Networks

Lately, I've been thinking quite a bit about neural networks. In particular, I've been wondering whether it is actually possible to understand them. As a note, this posting assumes that the reader has some understanding of neural networks. Of course, we at Data Miners, heartily recommend our book Data Mining Techniques for Marketing, Sales, and Customer Relationship Management for introducing neural networks (as well as a plethora of other data mining algorithms).

Let me start with a picture of a neural network. The following is a simple network that takes three inputs and has two nodes in the hidden layer:

Note that this structure of the network explains what is really happening. The "input layer" (the first layer connected to the inputs) standardizes the inputs. The "output layer" (connect to the output) is doing a regression or logistic regression, depending on whether the target is numeric or binary. The hidden layers are actually doing a mathematical operation as well. This could be the logistic function; more typically, though it is the hyperbolic tangent. All of the lines in the diagram have weights on them. Setting these weights -- plus a few others not shown -- is the process of training the neural network.

The topology of the neural network is specifically how SAS Enterprise Miner implements the network. Other tools have similar capabilities. Here, I am using SAS EM for three reasons. First, because we teach a class using this tool, I have pre-built neural network diagrams. Second, the neural network node allows me to score the hidden units. And third, the graphics provide a data-colored scatter plot, which I use to describe what's happening.

There are several ways to understand this neural network. The most basic way is "it's a black box and we don't need to understand it." In many respects, this is the standard data mining viewpoint. Neural networks often work well. However, if you want a technique that let's you undersand what it is doing, then choose another technique, such as regression or decision trees or nearest neighbor.

A related viewpoint is to write down the equation for what the network is doing. Then point out that this equation *is* the network. The problem is not that the network cannot explain what it is doing. The problem is that we human beings cannot understand what it is saying.

I am going to propose two other ways of looking at the network. One is geometrically. The inputs are projected onto the outputs of the hidden layer. The results of this projection are then combined to form the output. The other method is, for lack of a better term, "clustering". The hidden nodes actually identify patterns in the original data, and one hidden node usually dominates the output within a cluster.

Let me start with the geometric interpretation. For the network above, there are three dimensions of inputs and two hidden nodes. So, three dimensions are projected down to two dimensions.

I do need to emphasize that these projections are not the linear projections. This means that they are not described by simple matrices. These are non-linear projections. In particular, a given dimension could be stretched non-uniformly, which further complicates the situation.

I chose two nodes in the hidden layer on purpose, simply because two dimensions are pretty easy to visualize. Then I went and I tried it on a small neural network, using Enterprise Miner. The next couple of pictures are scatter plots made with EM. It has the nice feature that I can color the points based on data -- a feature sadly lacking from Excel.

The following scatter plot shows the original data points (about 2,700 of them). The positions are determined by the outputs of the hidden layers. The colors show the output of the network itself (blue being close to 0 and red being close to 1). The network is predicting a value of 0 or 1 based on a balanced training set and three inputs.

Hmm, the overall output is pretty much related to the H1 output rather than the H2 output. We see this becasuse the color changes primarily as we move horizontally across the scatter plot and not vertically. This is interesting. It means that H2 is contributing little to the network prediction. Under these particular circumstances, we can explain the output of the neural network by explaining what is happening at H1. And what is happening at H1 is a lot like a logistic regression, where we can determine the weights of different variables going in.

Note that this is an approximation, because H2 does make some contribution. But it is a close approximation, because for almost all input data points, H1 is the dominant node.

This pattern is a consequence of the distribution of the input data. Note that H2 is always negative and close to -1, whereas H1 varies from -1 to 1 (as we would expect, given the transfer function). This is because the inputs are always positive and in a particular range. The inputs do not result in the full range of values for each hidden node. This fact, in turn, provides a clue to what the neural network is doing. Also, this is close to a degenerate case because one hidden unit is almost always ignored. It does illustrate that looking at the outputs of the hidden layers are useful.

This suggests another approach. Imagine the space of H1 and H2 values, and further that any combination of them might exist (do remember that because of the transfer function, the values actually are limited to the range -1 to 1). Within this space, which node dominates the calculation of the output of the network?

To answer this question, I had to come up with some reasonable way to compare the following values:
  • Network output: exp(bias + a1*H1 + a2*H2)
  • H1 only: exp(bias + a1*H1)
  • H2 only: exp(bias + a2*H2)
Let me give an example with numbers. For the network above, we have the following when H1 and H2 are both -1:
  • Network output: 0.9994
  • H1 only output: 0.9926
  • H2 only output: 0.9749
To calculate the contribution of H1, I use the ratio of the sums of the squares of the differences, as in the following example for H1:
  • H1 contribution: (0.9994 - 0.9926)^2 / ((0.9994 - 0.9926)^2 + (0.9994 - 0.9749)^2)
The following scatter plot shows the regions where H1 dominates the overall prediction of the network using this metric (red is H1 is dominant; blue is H2 is dominant):

There are four regions in this scatter plot, defined essentially by the intersection of two lines. In fact, each hidden node is going to add another line on this chart, generating more regions. Within each region, one node is going to dominate. The boundaries are fuzzy. Sometimes this makes no difference, because the output on either side is the same; sometimes it does make a difference.

Note that this scatter plot assumes that the inputs can generate all combinations of values from the hidden units. However, in practice, this is not true, as shown on the previous scatter plot, which essentially covers only the lowest eights of this one.

With the contribution metric, we can then say that for different regions in the hidden unit space, different hidden units dominate the output. This is essentially saying that in different areas, we only need one hidden unit to determine the outcome of the network. Within each region, then, we can identify the variables used by the hidden units and say that they are determining the outcome of the network.

This idea leads to a way to start to understand standard multilayer perceptron neural networks, at least in the space of the hidden units. We can identify the regions where particular hidden units dominate the output of the network. Within each region, we can identify which variables dominate the output of that hidden unit. Perhaps this explains what is happening in the network, because the input ranges limit the outputs only to one region.

More likely, we have to return to the original inputs to determine which hidden unit dominates for a given combination of inputs. I've only just started thinking about this idea, so perhaps I'll follow up in a later post.


Labels: , , ,

Tuesday, August 5, 2008

B2B Data Mining

Last week, I travelled to Malaysia to apply data mining techniques to business-to-business (B2B) problems. More typically, the work we do at data miners is in the world of business-to-consumer (B2C) , since a company with millions of customers is more likely to need our services. However, a certain gentleman read my book Data Analysis Using SQL and Excel and wanted to apply the ideas to his business.

In this case, the business is a large computer hardware business, that sells complex customized systems to customers. In fact, the systems are so complex that they have a special organization to configure the systems. The joke is that this allows the sales force to spend more time on the golf course. More seriously, it reduces configuration mistakes and helps ensure that the end customers receive the right system.

The data that we worked with was primarily sales data. The data is not tiny; several tables contain millions of rows, and the whole database probably occupies about ten gigabytes of data. However, the data structure is more complex than in a typical B2C database.

First, the important item of data is a quote. This is a configuration of hardware, software, networking equipment, and services provided to the customer. Some quotes result in orders; many do not. Each customer typically has many quotes, and several employees work on each quote. However, typically only one of those employees (let's call that person the sales rep) is credited with the sale.

A quote contains lists of items. However, these items are at the most detailed level (let's call it the UPC -- universal product code -- level). The hierarchy of products is important, particularly the two highest levels.

The highest level is the product category -- hardware, software, and so on. The next highest level is the catalog-listing level. That is, the product name in the catalog. For instance, I am writing this on an old Dell Precision M60 -- that is the "catalog" name. However, the specific disk size, number of megabytes, peripherals, and software might differ from computer to computer. Does it have a 40 Gig disk or an 80 Gig disk? And so on.

What type of questions might be answered by a history of customer quotes, with associated dimension information?

One question is the "quote-to-close" question. This is a simple time-to-event problem of how long from when a quote is created to when the quote results in an order. Generally, companies in this business look at the close rate for quotes, but not the timing of the close. However, the timing provides important information, particularly when it is changing over time.

A related question is what affects the quote-to-close time? Complex quotes with many different components might lengthen the time. Similarly, large discounts that require more levels of approval might lengthen the time. In this case, we were surprised that certain small quotes had a very long time. These small quotes had large percentage discounts, and were often paired with larger orders. The large percentage discount required more levels of approval -- a sign that the company is spending too much effort in the wrong place.

Such data also suggests market basket analysis. "What products are purchased together?" is not quite the right question. Instead, the question is "What configurations are most common?"

Common configuations lead to additional questions. Which configurations (if any) are associated with longer quote-to-close times? Which configuraitons are associated with longer install times? Some products are readily available, some have a longer lead time, and some are built-to-order. The logistics database is not necessarily available for quotes. And, the historical logistics database (what is the lead time for products at a given point in the past) probably does not even exist.

Another genre of question is related to purchase velocity. After a customer makes a purchase, when does the next purchase take place? In this business, purchases can be one of several types. A new purchase represents new equipment; an upgrade is adds additional components to existing equipment; and a replacement replaces one model for another. The pattern of purchases is interesting, both at the customer level and the sales rep level. Some sales reps may be focusing on upgrade business (which is relatively easy, since the customer already has the product), and not expanding into new customers.

A related idea to purchase velocity is to categorize customers by the time since their most recent purchase. What volume of purchases are made by customers who purchased something in the previous month? In the previous twelve months? By year? Most customers make at least one purchase per year. For the dormant customers who return, what caused them to come back.

B2B data is more complicated than B2C, because purchases are more complex and customers are organizations with multiple sites and contacts. Often, people are paid to understand the customers and to meet their needs. However, sales reps understand individual customers. There is ample opportunity for data mining to learn patterns about customers in general and to learn what works and does not work ijn particular circumstances.

Labels: ,