Wednesday, November 12, 2008

Creating Accurate Venn Diagrams in Excel, Part 2

This post is an extention of an earlier post. If you are interested in this, you may be interested in my book Data Analysis Using SQL and Excel.

This post is about creating a Venn diagram using two circles. A Venn diagram is used to explain data such as:
  • Group A has 81 members.
  • Group B has 25 members.
  • There are 15 members in both groups A and B.
The above data is shown as a Venn diagram as:

Unfortunately, creating a simple Venn diagram is not built into Excel, so we need to create one manually. This is another example that shows off the power of Excel charting to do unexpected things.

Specifically, creating the above diagram requires the following capabilities:
  1. We need to draw a circle with a given radius and a center at any point.
  2. We need to fill in the circle with appropriate shading.
  3. We need to calculate the appropriate centers and radii given data.
  4. We need to annotate the chart with text.
Each of these are explained below. All of the charts and formulas are available in the accompanying Excel file.
Drawing a Circle Using Scatter Plots

To create the circle, we start with a bunch of points, that when connected with smoothed lines will look like a circle. To get the points, we'll create a table with values from 0 to 360 degrees, and borrow some formulas from trigonometry. These say:

  • X = radius*sin() + X-offset
  • Y = radius*cos() + Y-offset
The only slight complication is that the functions SIN() and COS() take their arguments in something called radian rather than degrees. This makes the formula look like:
  • X = radius*sin(*2*PI/360) + X-offset
  • Y = radius*cos(*2*PI/360) + Y-offset
The following shows the formulas:

Degrees X-Value Y-Value

0 =$E$4+SIN(2*PI()*B11/360)*$D$4 =$F$4+COS(2*PI()*B11/360)*$D$4

5 =$E$4+SIN(2*PI()*B12/360)*$D$4 =$F$4+COS(2*PI()*B12/360)*$D$4

10 =$E$4+SIN(2*PI()*B13/360)*$D$4 =$F$4+COS(2*PI()*B13/360)*$D$4

15 =$E$4+SIN(2*PI()*B14/360)*$D$4 =$F$4+COS(2*PI()*B14/360)*$D$4

20 =$E$4+SIN(2*PI()*B15/360)*$D$4 =$F$4+COS(2*PI()*B15/360)*$D$4

25 =$E$4+SIN(2*PI()*B16/360)*$D$4 =$F$4+COS(2*PI()*B16/360)*$D$4

30 =$E$4+SIN(2*PI()*B17/360)*$D$4 =$F$4+COS(2*PI()*B17/360)*$D$4

Where E4 contains the X-offset; F4 contains the Y-offset; and D4 contains the radius.

The degree values need to extend all the way to 360 to get a full circle, which can then be plotted as a scatter plot. When choosing which variety of the scatter plot, choose the option of points connected with smoothed lines.

The following chart shows the resulting circle with the points highlighted, along with axis labels and grid lines (which should be removed before creating the final version):

Creating a second circle is as easy as creating one, by just adding a second set of series onto the chart.

Filling in the Circle with Appropriate Shading

Unfortunately, to Excel, the circle is really just a collection of points, and we cannot fill it with shading. However, with a clever idea of using error bars, we can put in a pattern, such as:

The idea is to create X error bars for horizontal lines and Y error bars for vertical lines. To do this. right click on the circle and choose "Format Data Series". Then go to the "X Error Bars" or "Y Error Bars" tab (whichever is appropriate). Put 101 in the "Percent" box.

This adds the error bars. To format then, double click on one of them. You can set the color for them and also remove the little line at the edge.

You will notice that these bars are not evenly spaced. The spacing is related to the degrees. With the proper choice of degrees, the points would be evenly spaced. However, I do not mind the uneven spacing, and have not bothered to figure out a better set of points for even spacing.

Calculating Where the Circles Should Be

Given the area of a circle, calculating the radius is a simple matter of reversing the area formula. So, we have:
  • radius = SQRT(area/PI())
So, getting the radii for the two circles is easy. The questions is: where should the second circle be place to get the right overlap?

Unfortunately, there is no easy solution. First, we have to apply some complicated arithmetic to calculate the overlap between two circles, given a width of the overlap. Then we have to find the overlap that gives the correct area.

The first part is solved by finding the area of overlap between two circules, at a site such as Wolfram Math World.

The second is solved by using the "Goal Seek" functionality under the tools bar. We simple set up a worksheet that calculates the area of the overlap, given the width of the overlap and the two radii. One of the cells has the difference between this value and the area that we want. We then use Goal Seek to set this value to 0.

Annotating the Chart with Text
The final step is annotating the chart with text, such as "A Only: 65". First, we put this string in a cell, using a formula such as:
  • ="A Only: "&C4-C6
Then, we inlcude this text in the chart by selecting the chart, and typing "=" and followed by the cell address (or using the mouse).
Publish Post

In the end, we are able to create an accurate Venn diagram with two circles, of any size and overlap.


Labels: ,

Friday, October 24, 2008

Creating Accurate Venn Diagrams in Excel, Part 1

This post (and the next) are about creating accurate Venn diagrams using Excel charts. If you are interested in this, you may be interested in my book Data Analysis Using SQL and Excel.

Recently, I had occasion to analyze prescriber data for a project at a pharmaceutical company. On of the things we wanted to do was to compare visually the prescribing habits of psychiatrists, by places them into three groups: those who only prescribe drug A; those who only prescriber drug B; and those who prescribe both. The resulting chart is:

This chart is an example of a Venn diagram. Unfortunately, Excel does not have a built-in Venn diagram creator. And, if you do a google search, you will get many examples, where the circles are placed manually. Perhaps it is my background in data analysis, but I often prefer accuracy to laziness. So, I developed a method to create simple but accurate Venn diagrams in Excel.

Creating such diagrams is, fundamentally, rather simple. However, there is some math involved. To simplify the math, this post first describes how to create a Venn diagram where the two shapes are squares. In the next post, I'll extend the ideas to using circles.

Creating a Venn diagram requires understanding the following:
  1. Creating shapes in Excel.
  2. Calculating the correct overlap of the shapes.
  3. Putting it all together.
This post discusses each of these.

Creating a Shape in Excel

How does one create a shape using Excel charts. The simple answer here is using the scatter plot. If we want to make a square, we can simply plot the four corners of the square and connect them using lines, as in the following example:

Here the square has an area of 81, so each side is exactly nine units long. It is created using five data points:

X-Value Y-Value

-4.50 -4.50

-4.50 4.50

4.50 4.50

4.50 -4.50

-4.50 -4.50

Notice that the first point is repeated twice. Otherwise, there would be four points, but only three sides.

A small challenge in doing this is making the chart look like a square instead of a rectangle. Unfortunately, Excel does not make it easy to adjust the size of a chart, say by right clicking and just entering the width and height.

One way to make the chart square is to place it in a single cell and then adjust the row height and column height to be equal. My prefered method is just to eye-ball it. The above chart has a width of six columns and a height of 21 rows.

In this case, the square is centered on the origin. There is a reason for this. The temptation is to have the square be positioned at the origin and then pass through the points (0,9), (9,9), and (9,0). However, I find that when Excel draws the square, the axes interfere with the sides of the square, so some are shaded heavier than others. This happens even when I remove the axes.

As an aside here, you can imagine creating many different types of shapes in Excel besides squares. However, Excel only understands these as lines connecting a scatter plot. In particular, this means that you cannot color the interior of the shape.

Calculating the Overlaps

Assume that we have two squares that overlap, one square has an area of 100 (side is 10) and the other 25 (side is 5). What is the overlap between them?

There is not enough information to answer this question. It is clearly between 0 (if the squares do not overlap) and 25 (the size of the smaller square). If the overlap is 10, how big is the overlap? In the following picture, the area of C is 10.

What are the dimensions of C? The height is the height of the smaller square -- 5. So the width must be 2 (=10/5).

Putting It Together

To put this together for a Venn diagram using squares, we simply need to position two squares given the following information:
  • The sizes of the two squares.
  • The overlap between them.
Consider the original diagram at the top of this posting. The sizes of the two regions are 13,941 and 11,175 respectively. The overlap is 9,783.

The first thing to calculate is the side length for the two squares:
  • 118.07 for the first square (=sqrt(13,941)).
  • 105.71 for the second (=sqrt(11,175)).
Then, we need to calculate the width of the overlapping region (we already know its height and area):
  • 92.54 = 9,783 / 105.71
Now we need to calculate the points for the two squares. The way that I do the calculation is to place square at the origin, and then to add X- and Y- offsets to shift it around the plane. So, the general formula for the points are:
  • (0 + X-offset, 0 + Y-offset)
  • (side + X-offset, 0 + Y-offset)
  • (side + X-offset, side + Y-offset)
  • (0 + X-offset, side + Y-offset)
  • (0 + X-offset, 0 + Y-offset)
Since we know the side lengths of the two squares, I only need to calculate the offset values. The first square is centered at the origin (rather than starting there), so the offset is - side/2 for both X and Y.

The second square is centered vertically, so its Y-offset is also - side/2. The X-offset is the bigger challenge. In order to get the correct overlap, it is:
  • (side-first - X-offset-first) - overlap-width
The attached spreadsheet has these calculations. The data table on the spreadsheet looks like:

Area Side X Offset Y Offset

Left 13,941 118.07 -59.04 -59.04

Right 11,175 105.71 -33.51 52.86

Overlap 9,783 92.54

big square little square

-59.04 -59.04 72.20 52.86

-59.04 59.04 -33.51 52.86

59.04 59.04 -33.51 -52.86

59.04 -59.04 72.20 -52.86

-59.04 -59.04 72.20 52.86

The points are listed under "big square" and "little square". The first column is the X value for the big square, the second is the Y value; the third is the X value for the little square and the fourth is the Y value.

After creating the chart, you need to beautify it. I remove the axes and axis' labels, thicken the lines around the squares, and adjust the height and width to make the shape look like a square.

The attached .xls file (venn-20081025.xls) contains all the examples in this post.

The next post extends these ideas to creating Venn diagrams with circles, which are the more typical shape for them. It also shows one way to put some color in the shapes to highlight the different regions.

Labels: ,

Friday, September 5, 2008

Sorting Cells in Excel Using Formulas, Part 2

In a previous post, I described how to create a new table in Excel from an existing table where the cells in the new table are sorted by some column in the existing table. In addition, the new table is automatically updated when the values in the original table are modified.

The previously described approach, alas, has some shortcomings:
  • Only one column can be used for the sort key.
  • The column must be numeric.
  • The column cannot have any duplicate values.
This post generalizes on the earlier method by fixing these problems.

If you are interested in this post, you may be interested in my book Data Analysis Using SQL and Excel.

Overview of Simpler Method

The simpler method described in the earlier post recognizes that creating a live sorted table connect to another table consists of the following steps:
  1. Ranking the rows in the table by the column to be sorted.
  2. Using the rank with the OFFSET() function to create the resulting table.
For Step (1), the method uses the built-in RANK() function provided by Excel. This introduces the limitations described above, because RANK() only works on numeric values and produces the same value for duplicates.

The key to fixing these problems is to replace the RANK() function with more general purpose functions.

Instead of RANK()

RANK() determines whether a value is the largest, second largest, third largest, or so on with respect to a list (or smallest, if we are going in the opposite order, which is determined by an optional third argument). One way to think of what it does is that it sorts the values in the list and determines the position of the original value.

An alternative but equivalent way of thinking about the calculation is that it tells us how many values are larger than (or smaller than) the given value. This alternative definition suggests other ways of arriving at the same rankings, such as:

....=COUNTIF(data!B$2:B$55, ">="&data!B2)

This formula can be placed alongside the original table (or anywhere else) and then copied down. It works by counting the number of values that are less than or equal to each value. The resulting ranking is from smallest value to largest value. To reverse the order, simply use "<=" instead. This solves one of the original problems, because the COUNTIF() function works with string data as well as numeric data.

An almost equivalent formulation is to use array functions.

....{=SUM(if(data!B$2:B$55>=data!B2, 1, 0)}

(If you are not familiar with array functions, check out Excel documentation or Data Analysis Using SQL and Excel.)

This is very similar to the COUNTIF() method, although the array functions have one advantage. The conditional logic can be more complicated, so we can do the ranking by multiple columns at the same time.

Using our own version of the rank function fixes two of the three problems. At this point, duplicates still get the same rank value.

Handling Duplicates

The problem with duplicate values is that all these methods assign the same ranking when two different rows have the same value. This makes it impossible to distinguish between the two rows, so one will be included in the sorted table multiple times.

The solution is to fix this problem by adding an offset. If the highest value is repeated multiple times, then all of those rows will have a ranking equal to the number of duplicates. In the following little table, the second column contains the rankings as calculated by either of the above two methods (RANK() does not work because the first column is not numeric):

a 3

a 3

a 3

b 5

b 5

What we want, though, is to have distinct values in the second column:

a 1

a 2

a 3

b 4

b 5

The solution is to subtract a value from the calculated ranking. This is the number of values that we have already seen that are equal to the value in question. Once again, this can be accomplished with either COUNTIF() or array functions:

....=COUNTIF(data!B$2:B$55, ">="&data!B2) + COUNTIF(data!B$2:B2, "="&data!B2)-1


....{=SUM(IF(data!B$2:B$55>=data!B2, 1, 0)) + SUM(IF(data!B$2:B2=data!B2, 1, 0))-1}

These formulations consist of two parts. The first part calculates a ranking, where duplicates get the same value. The second part subtracts the number of duplicates already seen in the list. For the simple example above, the results are actually:

a 3

a 2

a 1

b 5

b 4

This works just as well, although it does not preserve the original ordering.

Note that these formulas are all structured so they can be copied down cells and continue working.

What It All Looks Like Together

This method is perhaps best explained by seeing an example. The file sort-in-place.xls contains random information about the fifty states (latitude, longitude, population, and capital for example) on the "data" worksheet. The "data-sorted" worksheet shows the states abbreviations by rank order for each of the columns. For instance, for the size column Alaska is first, followedy by Texas, California, and Montana. For the population columns, the ordering is California, Texas, New York, and Florida. This worksheet using the rankings on the "ranking-countif()" worksheet.

The three worksheets called "ranking-" illustrate the three different methods of doing the rankings -- using RANK(), using COUNTIF(), and using array functions. Note that the RANK() method cannot handle text columns, so it does not work in this case.

If you like, you can change the data on the "data" tab and see the rankings change on the sorted tab. Voila! A sorted table connected by formulas to the original table!

Labels: ,

Monday, August 11, 2008

Sorting Cells in Excel Using Formulas

This post describes how to create a new table in Excel from an existing table, where the cells are sorted, and to do this using only formulas. As a result, modifying a value in the original table results in rearranging the sorted table. And, this is accomplished without macros and without using the "sort" menu option.

The material in this post is generalized in another post. Also, if you are interested in this post, you may be interested in my book Data Analysis Using SQL and Excel.

Consider a table with two columns:





The sorted table would be automatically calculated as:





If the first value were changed to 5, then the sorted table would automatically recalculate as:





There are two typical approaches to sorting cells in Excel. The first is to select a region and to sort it using menu options. This does not work when the cells are protected, part of a pivot table, and sometimes when they are calculated. This might also be a bad idea when the data is copied from another location or loaded by accessing a database.

A common alternative is to resort to writing a macro. However, Visual Basic macros are beyond the capabilities of even many experienced Excel users.

The approach described here is much simpler, since it only uses formulas. I should mention that the method described in this post only works for numeric data that has no duplicates. I will remedy that in the next post, where the ideas are extended both to data with duplicates and to character data.

Three Excel functions are the key to the idea: RANK(), MATCH(), and OFFSET(). The first function ranks numbers in a list. The second allows us to use this info to sort the list.

The following shows the effect of the RANK() function:






The function itself looks like:


15B=RANK(C5, $C$5:$C$8, 1)

14D=RANK(C6, $C$5:$C$8, 1)

13C=RANK(C7, $C$5:$C$8, 1)

11A=RANK(C8, $C$5:$C$8, 1)

The first argument is the value to be ranked. The cell C5 contains the value "15". The second argument is the range to use for h the ranking -- these are all the values in the cell. And the third is the direction, which means the lowest values get the lowest rankings. In this case, "15" is the largest value of four, so its rank is 4.

The following shows the formulas for the table:


1=OFFSET(C$4, MATCH(C11, $E$5:$E$8, 0), 0)=OFFSET(D$4, MATCH(C11, $E$5:$E$8, 0), 0)

2=OFFSET(C$4, MATCH(C12, $E$5:$E$8, 0), 0)=OFFSET(D$4, MATCH(C12, $E$5:$E$8, 0), 0)

3=OFFSET(C$4, MATCH(C13, $E$5:$E$8, 0), 0)=OFFSET(D$4, MATCH(C13, $E$5:$E$8, 0), 0)

4=OFFSET(C$4, MATCH(C14, $E$5:$E$8, 0), 0)=OFFSET(D$4, MATCH(C14, $E$5:$E$8, 0), 0)

The first column is the desired ranking column. This is simply the numbers starting at 1 and incrementing by 1. The function MATCH(C11, $E$5:$E$8, 0) simply looks up the ranks in the column of calculated ranks. So, the value in C11 is "1". In the previous column, this is the fourth value. The OFFSET() function then finds the fourth value in the C column for the value and the fourth value in the D column for the label.

The result is that the sorted table is tied to the original table by formulas, so changing values in the original table will result in recalculating the sorted table, automatically.

The overall approach is simple to describe. First, we need to calculate the ranking of each row in the original table based on the column that we want to sort. This ranking takes on the values 1 to N fo rthe values in the table. Then, we create a new sorted table that has the rankings in order. This table looks up the appropriate row for each ranking using the MATCH() function. Finally, the OFFSET() function is used to lookup the appropriate values from the appropriate row. The result is a table that is sorted with a "live" connection to another table.

Labels: ,