Monday, November 14, 2005

Lies, damned lies, and Excel formulae

I made a discovery the other day. Not one of those "eureka" moments beloved of bathing Greeks, but something that prompted me to wonder about the accuracy of some of the figures we take for granted. We are so used to Excel on every desktop that we trust it implictly, and so when I needed to work out the standard deviation of some figures, I naturally turned to Excel. For those of you whose maths is rusty, standard deviation is how spread out a sample of numbers are. For example: the average of 1,3,5,7,9 is 5, and so is 3,4,5,6,7 but it can be seen that the latter sequence has its numbers more closely bunched. Standard deviation is just a mathematical measure of how close or otherwise that bunching is (in the examples above the standard deviation of the first set is 2.83 and the second set 1.41 i.e. the second set is closer bunched than the first set).

In Excel to use a function you just type into a cell something like "=average(1,3,5,7,9)" and magically you get the answer (5 in this case). So, what could be easier to do than type in:
"=stdev(1,3,5,7,9)" and see the answer appear? The trouble is that it doesn't. The answer pops up as 3.16, not 2.83 as I was expecting. Just in case you doubt my ability to calculate a standard deviation, feel free to do it the old-fashioned way by hand, and you will see that you get 2.83, not 3.16, so what is going on? After digging around the Excel help and chatting to a mathematician friend to check I was not going completely mad, I discovered that there are actually two different standard deviation functions in Excel, one designed for where you want the whole sample set measured, and one where you want to estimate a large population from a sample, which has a slightly different formula. Now I may be getting a bit slow these days, but I did do a maths degree and yet this distinction had eluded me all these years, so I doubt I'm the only person out there unaware of this difference. If you were the person at Microsoft naming Excel functions, which do you think that people would think was the "normal" version, "STDEV" or "STDEVP", which is what they actually named the function that calculates the standard deviation of a whole population. I am guessing that not too many of us go "aha, I'll try "=STDEVP I expect that will be it".

Now this may seem like a lot of fuss about an esoteric mathematical function, but be aware that standard deviation is one of the most commonly used statistical functions, used to look at samples of population, mechanical failure rates, delivery errors, temperatures, patient response rates, you name it. People take serious decisions based on statistics: which drug to put forward for clinical trials, traffic planning, machine maintenance and endless others; standard deviation is the most commonly used tool in "statistical process control", widespread in the manufacturing industry. Given that the most of the modern world uses Excel, I find it pretty surprising that a sizeable proportion of the world has been using the wrong standard deviation function for the last twenty years all because some idiot in Seattle chose a "precise" name rather than the obvious name that most of us would have chosen.

I suppose this is only what I should have expected from a product who thinks that 1900 is a leap year. Try the formula "=DATE(1900,2,29)" and watch it happily display the 29th February 1900. As we should all be aware after the fuss over Y2k, 1900 is NOT a leap year (leap years are every four years, except centuries which are not, expect every fourth century, which is, so 1600 and 2000 are leap years, but not 1800 or 1800 or 1900). The moral of this little story: don't take everything on trust!

Brian said...

Just to let you know, I recently finished an undergrad statistics course. During the course, we were taught two different standard deviation formulas - one for an exact or small population and one for estimating a large population. Unfortunately, I don't have my book with me, but the calculator that we used also differentiated between the two formulas.

5:16 AM
Andy Hayler said...

Thanks very much indeed for your comment. I now understand this distinction, but am amazed that Microsoft chose to name the “sample” version STDDEV and the more obvious one (that appears in every school text book) STDDEVP, as surely a lot of people are just going to assume that STDDEV is the “main” one? Rather than the other way around.

5:49 AM
Anonymous said...

This comment has been removed by a blog administrator.

11:01 AM