Friday, September 5, 2008

Sorting Cells in Excel Using Formulas, Part 2

In a previous post, I described how to create a new table in Excel from an existing table where the cells in the new table are sorted by some column in the existing table. In addition, the new table is automatically updated when the values in the original table are modified.

The previously described approach, alas, has some shortcomings:
  • Only one column can be used for the sort key.
  • The column must be numeric.
  • The column cannot have any duplicate values.
This post generalizes on the earlier method by fixing these problems.

If you are interested in this post, you may be interested in my book Data Analysis Using SQL and Excel.

Overview of Simpler Method

The simpler method described in the earlier post recognizes that creating a live sorted table connect to another table consists of the following steps:
  1. Ranking the rows in the table by the column to be sorted.
  2. Using the rank with the OFFSET() function to create the resulting table.
For Step (1), the method uses the built-in RANK() function provided by Excel. This introduces the limitations described above, because RANK() only works on numeric values and produces the same value for duplicates.

The key to fixing these problems is to replace the RANK() function with more general purpose functions.

Instead of RANK()

RANK() determines whether a value is the largest, second largest, third largest, or so on with respect to a list (or smallest, if we are going in the opposite order, which is determined by an optional third argument). One way to think of what it does is that it sorts the values in the list and determines the position of the original value.

An alternative but equivalent way of thinking about the calculation is that it tells us how many values are larger than (or smaller than) the given value. This alternative definition suggests other ways of arriving at the same rankings, such as:

....=COUNTIF(data!B$2:B$55, ">="&data!B2)

This formula can be placed alongside the original table (or anywhere else) and then copied down. It works by counting the number of values that are less than or equal to each value. The resulting ranking is from smallest value to largest value. To reverse the order, simply use "<=" instead. This solves one of the original problems, because the COUNTIF() function works with string data as well as numeric data.

An almost equivalent formulation is to use array functions.

....{=SUM(if(data!B$2:B$55>=data!B2, 1, 0)}

(If you are not familiar with array functions, check out Excel documentation or Data Analysis Using SQL and Excel.)

This is very similar to the COUNTIF() method, although the array functions have one advantage. The conditional logic can be more complicated, so we can do the ranking by multiple columns at the same time.

Using our own version of the rank function fixes two of the three problems. At this point, duplicates still get the same rank value.

Handling Duplicates

The problem with duplicate values is that all these methods assign the same ranking when two different rows have the same value. This makes it impossible to distinguish between the two rows, so one will be included in the sorted table multiple times.

The solution is to fix this problem by adding an offset. If the highest value is repeated multiple times, then all of those rows will have a ranking equal to the number of duplicates. In the following little table, the second column contains the rankings as calculated by either of the above two methods (RANK() does not work because the first column is not numeric):

a 3

a 3

a 3

b 5

b 5

What we want, though, is to have distinct values in the second column:

a 1

a 2

a 3

b 4

b 5

The solution is to subtract a value from the calculated ranking. This is the number of values that we have already seen that are equal to the value in question. Once again, this can be accomplished with either COUNTIF() or array functions:

....=COUNTIF(data!B$2:B$55, ">="&data!B2) + COUNTIF(data!B$2:B2, "="&data!B2)-1


....{=SUM(IF(data!B$2:B$55>=data!B2, 1, 0)) + SUM(IF(data!B$2:B2=data!B2, 1, 0))-1}

These formulations consist of two parts. The first part calculates a ranking, where duplicates get the same value. The second part subtracts the number of duplicates already seen in the list. For the simple example above, the results are actually:

a 3

a 2

a 1

b 5

b 4

This works just as well, although it does not preserve the original ordering.

Note that these formulas are all structured so they can be copied down cells and continue working.

What It All Looks Like Together

This method is perhaps best explained by seeing an example. The file sort-in-place.xls contains random information about the fifty states (latitude, longitude, population, and capital for example) on the "data" worksheet. The "data-sorted" worksheet shows the states abbreviations by rank order for each of the columns. For instance, for the size column Alaska is first, followedy by Texas, California, and Montana. For the population columns, the ordering is California, Texas, New York, and Florida. This worksheet using the rankings on the "ranking-countif()" worksheet.

The three worksheets called "ranking-" illustrate the three different methods of doing the rankings -- using RANK(), using COUNTIF(), and using array functions. Note that the RANK() method cannot handle text columns, so it does not work in this case.

If you like, you can change the data on the "data" tab and see the rankings change on the sorted tab. Voila! A sorted table connected by formulas to the original table!

Labels: ,


Anonymous Sandra said...

Thanks for the post - this was very helpful solution. I had tried several methods with arrays, but kept running into issues with duplicates. This one is so simple, but works great!

December 23, 2008 9:39 AM  
Anonymous Ian Leong said...

I second the previous comment. Ranking data from another tab with duplicate values is something I have been struggling with for a while. I am trying to create a dashboard with Top 10 values for several KPIs. Fortunately for me, my rows are also states so your Excel sheet with sample data was a perfect guide. Thanks again and I will be checking back regularly!


March 29, 2009 6:25 PM  
Anonymous Ian Leong said...

OK here is my next challenge. My dashboard returns Top 10 values for several KPIs. Each set of KPIs on the dashboard contains two columns: one for the state and one for the corresponding value. How do I return the correspoding state for each Top 10 value? Currently trying to use OFFSET function, but I'm either using bad syntax or using the wrong approach. Any help would be appreciated.

BTW, your blog is a data miner's dream. Thanks!

March 29, 2009 6:56 PM  

Post a Comment

Links to this post:

Create a Link

<< Home