A practical understanding of probability and statistics at an advanced, at least college, level is increasingly important in the modern world. For example, many expensive and potentially hazardous drugs including chemotherapy for cancer and anti-cholesterol drugs such as Lipitor are approved for use and justified to patients based on complex statistical studies. Children are being increasingly medicated for a range of alleged psychiatric disorders such as Attention Deficit Hyperactivity Disorder (ADHD or ADD), bipolar disorder, and others. Many questions have arisen about the seeming epidemic of autism (see the recent article The Mathematics of Autism).
Important public policy issues such as “global warming” hinge on complex mathematical models and statistics. The public is often swayed by shocking statistics widely repeated such as “the Soviet Union is producing two to three times as many engineers and scientists as the United States” (1950’s), “one million missing children” (1980’s), and “drugs cost $800 million dollars” to research and develop.
Complex mathematical and statistical models for mortgage backed securities played a major role in the financial crash in 2008 and the housing bubble. The financial system continues to rely on these so-called derivative securities despite numerous costly failures.
On the positive side, free open source tools with powerful statistical capabilities are widely available such as GNU Octave and the R programming language. More and more data is available in accessible formats such as comma separated values (CSV) files, tab-delimited files, or Excel spreadsheets. LibreOffice is a free open source program that can read most Excel spreadsheet formats. More information on probability and statistics is available at Wikipedia and other online sources. The National Institutes of Health, to its credit, is, for now, attempting to make research data and papers funded by the NIH openly available. Many other research programs seem to be trying to do this. Hopefully these trends will continue.
Formal college level education in probability and statistics tends to focus on idealized situations such as flipping a fair coin, games of chance at a fair casino, and highly idealized laboratory experiments in “hard sciences” such as physics that lack many of the actual difficulties encountered in frontier research or real world data in “softer” fields such as economics, finance, medicine, biology, psychiatry, marketing, and so on. In many real world situations, the major problems encountered, including issues such as how the data is collected and how the numbers are defined, differ from typical textbook accounts of probability and statistics.
This article discuses the pitfalls and gotchas of probability and statistics in practice.
Averages, Medians, and Distributions
Averages can be highly misleading. For example, these two sequences of ten numbers have the same average value — ten (10):
octave-3.2.4.exe:10> a = [10 10 10 10 10 10 10 10 10 10]; octave-3.2.4.exe:11> mean(a) ans = 10 octave-3.2.4.exe:12> median(a) ans = 10 octave-3.2.4.exe:13> b = [1 1 1 1 1 1 1 1 1 91]; octave-3.2.4.exe:14> mean(b) ans = 10 octave-3.2.4.exe:15> median(b) ans = 1
The average or arithmetic mean is the sum of all the numbers in the sequence divided by the number of values. The median is the value in the sequence or the average of two neighboring values when ordered in increasing value such that there is an equal number of elements of the sequence greater than than the median value and number of elements less than the median value.
The median is an example of a robust statistic that is less susceptible to misleading outliers in the data. It is often better to look at the median instead of the average, especially with noisy real-world data.
The median can also be misleading. These two sequences have the same median — ten (10) — but are quite different.
octave-3.2.4.exe:10> a = [10 10 10 10 10 10 10 10 10 10]; octave-3.2.4.exe:11> mean(a) ans = 10 octave-3.2.4.exe:12> median(a) ans = 10 octave-3.2.4.exe:21> b = [0 0 0 0 10 10 100 100 100 100]; octave-3.2.4.exe:22> median(b) ans = 10
In the first case, the median value ten is highly representative of the typical value in the sequence. In the second case, the spread of values is very high and the median is misleading about the typical values — zero and one-hundred.
Any single statistic such as the average, median, or mode (most common value in the data) can be misleading depending on the underlying distribution of the sequence and the context in which the statistic is used.
No matter how convincing a statistic may seem, it is best to examine the distribution of the underlying data.
Outliers and the Bell Curve
The Gaussian, also known as the Normal Distribution or Bell Curve, is very heavily used, often improperly, in statistics.
The Gaussian is taught in almost all introductory probability and statistics, at least at the college level. There is a theorem, known as the Central Limit Theorem, that the average of a sequence of independent identically distributed (IID) variables converges to the Gaussian distribution as the number of variables in the sequence (N) tends to infinity.
This is some data generated according to the Gaussian/Normal Distribution/Bell Curve with mean 0.0 and standard deviation of 1.0.
The Gaussian/Normal/Bell Curve is very heavily used in mathematical models today. However, despite the Central Limit Theorem, many real-world distributions are not Gaussian and have long tails. The data often contains outliers.
Several mathematical models used in quantitative finance such as the famous Black-Scholes Option Pricing Model use the Gaussian distribution. They often assume the returns for a financial asset are distributed according to a Gaussian distribution. Historical data shows that the returns for many financial assets do not have a Gaussian/Normal/Bell Curve distribution and often contain extreme “fat tail” outliers such as market crashes. Mathematical models using a Gaussian distribution tend to underestimate the risks of financial assets.
Statistical significance can be a treacherous concept. Statistical significance is often reported as something known as a p value. The p value usually refers to the probability that the data, set of measurements, could have been due to pure chance. The lower the p value, the greater the statistical significance of a result.
Consider flipping a coin. The probability that five heads will appear by chance in a row is:
or 3.125 percent (0.03125).
This is less than five percent. Many scientific journals accept papers that report a p value of five percent or less for their results. The p value is often interpreted as meaning there is a probability that the hypothesis being tested is correct, but that is not really correct.
Keep in mind that people flip coins and get five heads (or five tails) in a row all the time. With a p value of only five percent, one in twenty published papers reporting a p value of five percent will be wrong purely by chance.
Would you live in a house that had a five percent chance of collapsing on you? Drive over a bridge that had a five percent chance of collapsing as you cross the bridge? Probably not. Even though ninety-five percent seems high and is typically an A in classroom homework, it is not a very high level of confidence in the real world.
The p value also tells you nothing about whether the “statistically significant” effect was due to the hypothesis being tested or the cause suggested by the authors of a scientific paper or study. Quite a number of studies in parapsychology (ESP, etc.) have produced impressive levels of statistical significance. Is this due to the hypothesized paranormal cause, sophisticated cheating, or some other unknown cause. Something else is very difficult to rule out.
Statistical significance is not the same as the strength of an effect. For example, drug A might have an effect of 1.0 on some scale whereas drug B has an effect of 1.0000001, a negligible improvement in practice, but the statistical significance of this result could be extremely high. The p value could be one in a trillion. One may be very confident of a tiny, unimportant difference.
In some fields such as experimental particle physics, there is skepticism about the interpretation of the p value or equivalent measures of statistical significance. This is because many results that have been reported with very low p values nonetheless could not be replicated. In some cases, such as the pentaquark, several different research groups reported the same or a similar effect which ultimately “went away.”
Probability and statistics says little about systematic errors. The OPERA experiment’s spurious report of faster than light neutrinos was due to a systematic error in measuring time delays, very tiny time delays. The results was statistically significant but incorrect for other reasons.
Correlation and Causation
Correlation does not prove causation. There are many statistical methods and single statistics (number) that measure whether two or more measurements are correlated. Even if A and B are perfectly correlated, this can mean A causes B, B causes A, A and B share a common cause, or even certain kinds of chance occurrences.
Common Correlation Coefficients in GNU Octave
octave-3.2.4.exe:8> data = randn(1, 100); octave-3.2.4.exe:9> data2 = 2.0*data; octave-3.2.4.exe:10> corrcoef(data, data2) ans = 1.0000 octave-3.2.4.exe:11> data3 = randn(1,100); octave-3.2.4.exe:12> corrcoef(data, data3) ans = -0.080590 octave-3.2.4.exe:13> kendall(data, data2) ans = 1 octave-3.2.4.exe:14> spearman(data, data2) ans = 1 octave-3.2.4.exe:15> kendall(data, data3) ans = -0.028283 octave-3.2.4.exe:16> spearman(data,data3) ans = -0.049889 octave-3.2.4.exe:17>
In the GNU Octave code above, randn generates random data with the normal distribution with mean 0.0 and standard deviation 1.0. data and data2 are perfectly correlated since data2 is exactly two times data. data and data3 are uncorrelated. The function corrcoef computes Pearson’s correlation coefficient, the most commonly used correlation coefficient. Frequently, this is what is used to say two data sets are correlated. The functions kendall and spearman implement other, less commonly used correlation coefficients.
Even though most scientists, mathematicians, and statisticians are taught that correlation does not prove causation, it is common to find this disregarded in practice, especially in biology and medicine. Many prominent theories in biology and medicine are based, on close examination, on a correlation, perhaps a very strong correlation, but only a correlation.
Beware of the use of language such as “the link between A and B” or “the relationship between A and B” used as if “link” or “relationship” means A causes B (or B causes A). Link and relationship are very general terms. If A and B are correlated, one can honestly say there is a “link” or “relationship” between A and B, even though causation is not actually proven by a correlation.
Categories and Definitions
By far the greatest and most common problem with using probability and statistics in the real world lies in the definition of terms, categories, and measured values. When counting the number of engineers produced by the United States, the Soviet Union in the 1950’s, China, or other nations, what is an engineer? What is a missing child in “one million missing children?” What does it mean to say someone has been cured of cancer or has survived cancer? What is autism?
An engineer can be: someone with a B.S. in an engineering discipline, someone licensed to practice as an “engineer” by a government body, a Ph.D. in an engineering discipline, an A.A. in an engineering discipline, a technician with a high school diploma or GED, an enthusiast with an 8th grade education like Orville and Wilbur Wright, a civil engineer, an electrical engineer, a “software engineer,” a computer programmer, a medical technician, a nurse, an agricultural technician and so on.
In the 1950’s and 1960’s, Soviet expert Nicholas DeWitt used a broad definition of scientists and engineers to argue that the Soviet Union produced two to three times as many scientists/engineers as the United States, by amongst other things including engineers receiving correspondence degrees, medical workers including nurses, and agricultural workers in his total (see MIT Historian David Kaiser’s article The Physics of Spin: Sputnik Politics and American Physicists in the 1950s).
A missing child can be a teenager who runs away from home after an argument for a few hours. A missing child can be a child who leaves voluntarily, but illegally, with a non-custodial parent. A missing child can be a child abducted by a non-custodial parent. A missing child can be a long term runaway or “throwaway.” A missing child can be a child abducted and killed by a psychopath. In the 1980’s, even to the present day occasionally, the statistic “one million missing children” (even larger numbers were sometimes cited) was used to imply the latter. Fortunately, most reported missing children cases involve short term runaways or parental custody cases, certainly cause for concern in some cases but not an epidemic of homicide or stranger abductions.
In the medical literature, being “cured” of cancer or “surviving” cancer often means living for at least/no more than five years after being diagnosed with the disease. This differs dramatically from common English usage of the words “cured” and “survive.” Since cancer is often a slow progressing disease — many people with untreated cancer will live at least five years — this practice is particularly misleading.
The statistics on the prevalence of autism from the United States Centers for Disease Control (CDC) are extremely difficult to interpret due to the vague and broad definition of “autism spectrum disorders,” a situation the CDC has done little to resolve despite many years and billions of dollars in funding for autism research.
These definitional issues are rarely discussed, usually briefly if at all, in introductory college level textbooks on probability and statistics. These textbooks deal with very clean, well defined situations such as flipping an idealized perfectly fair coin. Heads is well defined and unambiguous. Tails is equally well defined and unambiguous. There is no question that the coin has an equal chance of coming up heads or tails. There is no cheating.
In public policy debates, scientific controversies, and other real-world applications of probability and statistics issues about how the data were collected, how the terms and values are measured and defined, and what the categories used actually mean often take center stage and are the subject both of bitter controversy and simple confusion. It often requires extensive research to resolve these issues; often they are not resolved, certainly to the satisfaction of all.
A good understanding of probability and statistics is increasingly necessary in the modern world. There are many ways to misuse probability and statistics, both intentionally and by accident. One should almost never take a statistic at face value, especially when powerful vested interests are at stake. The best course of action is to examine the data and the analysis of the data carefully. Unfortunately, this is often time consuming, but there is no substitute for important issues.
© 2012 John F. McGowan
About the Author
John F. McGowan, Ph.D. solves problems using mathematics and mathematical software, including developing video compression and speech recognition technologies. He has extensive experience developing software in C, C++, Visual Basic, Mathematica, MATLAB, and many other programming languages. He is probably best known for his AVI Overview, an Internet FAQ (Frequently Asked Questions) on the Microsoft AVI (Audio Video Interleave) file format. He has worked as a contractor at NASA Ames Research Center involved in the research and development of image and video processing algorithms and technology. He has published articles on the origin and evolution of life, the exploration of Mars (anticipating the discovery of methane on Mars), and cheap access to space. He has a Ph.D. in physics from the University of Illinois at Urbana-Champaign and a B.S. in physics from the California Institute of Technology (Caltech). He can be reached at email@example.com.
Using Murder: The Social Construction of Serial Homicide
By Philip Jenkins
This book about a depressing topic is somewhat pedantic but has some good discussions of the use and misuse of crime statistics for serial killers and murders in the 1980s.
The $800 Million Pill: The Truth behind the Cost of New Drugs
By Merrill Goozner
A critical look at the claims that drugs cost an average of $800 million to research and develop, paid by pharmaceutical companies.
Toil, Trouble, and the Cold War Bubble: Physics and the Academy since World War II
David Kaiser’s Presentation at the Perimeter Institute on the Cold War Physics Bubble
Includes a detailed discussion of how Nicholas DeWitt’s Scientist and Engineer Production Numbers were used and abused during the Cold war.
When Genius Failed: The Rise and Fall of Long-Term Capital Management
By Roger Lowenstein
A dry run for the current financial crisis with a good, non-technical discussion of the fat tails problem in quantitative finance.
Get more stuff like this
Get interesting math updates directly in your inbox.
Thank you for subscribing. Please check your email to confirm your subscription.
Something went wrong.