### Agglomerative Variable Clustering

Lately, I've been thinking about the topic of reducing the number of variables, and how this is a lot like clustering variables (rather than clustering rows). This post is about a method that seems intuitive to me, although I haven't found any references to it. Perhaps a reader will point me to references and a formal name. This method using Pearson correlation and principal components to agglomeratively cluster the variables.

Agglomerative clustering is the process of assigning records to clusters, starting with the records that are closest to each other. This process is repeated, until all records are placed into a single cluster. The advantage of agglomerative clustering is that it creates a structure for the records, and the user can see different numbers of clusters. Divisive clustering, such as implemented by SAS's varclus proc, produces something similar, but from the top-down.

Agglomerative variable clustering works the same way. Two variables are put into the same cluster, based on their proximity. The cluster then needs to be defined in some manner, by combining information in the cluster.

The natural measure for proximity is the square of the (Pearson) correlation between the variables. This is a value between 0 and 1 where 0 is totally uncorrelated and 1 means the values are colinear. For those who are more graphically inclined, this statistic has an easy interpretation when there are two variables. It is the R-square value of the first principal component of the scatter plot.

Combining two variables into a cluster requires creating a single variable to represent the cluster. The natural variable for this is the first principal component.

My proposed clustering method repeatedly does the following:

proc sql;

....select colname

....from columns

....where counter <= [some number] <>

These variables can then be used for predictive models or visualization purposes.

The inner loop of the code works by doing the following:

Agglomerative clustering is the process of assigning records to clusters, starting with the records that are closest to each other. This process is repeated, until all records are placed into a single cluster. The advantage of agglomerative clustering is that it creates a structure for the records, and the user can see different numbers of clusters. Divisive clustering, such as implemented by SAS's varclus proc, produces something similar, but from the top-down.

Agglomerative variable clustering works the same way. Two variables are put into the same cluster, based on their proximity. The cluster then needs to be defined in some manner, by combining information in the cluster.

The natural measure for proximity is the square of the (Pearson) correlation between the variables. This is a value between 0 and 1 where 0 is totally uncorrelated and 1 means the values are colinear. For those who are more graphically inclined, this statistic has an easy interpretation when there are two variables. It is the R-square value of the first principal component of the scatter plot.

Combining two variables into a cluster requires creating a single variable to represent the cluster. The natural variable for this is the first principal component.

My proposed clustering method repeatedly does the following:

- Finds the two variables with the highest correlation.
- Calculates the principal component for these variables and adds it into the data.
- Maintains the information that the two variables have been combined.

proc sql;

....select colname

....from columns

....where counter <= [some number] <>

These variables can then be used for predictive models or visualization purposes.

The inner loop of the code works by doing the following:

- Calling proc corr to calculate the correlation of all variables not already in a cluster.
- Transposing the correlations into a table with three columns, two for the variables and one for the correlation using proc transpose.
- Finding the pair of variables with the largest correlation.
- Calculating the first principal component for these variables.
- Appending this principal component to the data set.
- Updating the columns data set with information about the new cluster.

Labels: gordon, SAS Code, statistics

## 3 Comments:

You might be interested in reading:

A data-driven functional projection approach for the selection of feature ranges in spectra with ICA or cluster analysis

Catherine Krier, Fabrice Rossi, Damien François and Michel Verleysen

Chemometrics and Intelligent Laboratory Systems, Elsevier, Vol. 91, No. 1 (15 March 2008), pp. 43-53.

http://www.dice.ucl.ac.be/~verleyse/papers/cils08ck.pdf

and the references therein

Sas Proc VarClus

PROC VARCLUS is specifically my inspiration for thinking about an agglomerative approach to clustering variables. PROC VARCLUS implements various divisive methods, where all the variables are included in a single cluster, and this gets broken into smaller clusters. I realize that when I first wrote this post, I used the term "hierarchical" when I should have used the term "agglomerative"; I have since fixed that.

## Post a Comment

## Links to this post:

Create a Link

<< Home