|
|
Line 1: |
Line 1: |
| [[Image:Spurious Correlation (Greyscale).svg|thumb|300px|right|An illustration of spurious correlation, this figure shows 500 observations of x/z plotted against y/z. The sample correlation is 0.53, even though x, y, and z are statistically independent of each other (i.e., the pairwise correlations between each of them are zero).]]
| | Hi there. Let me start by introducing the author, her name is Sophia. What me and my family adore is doing ballet but I've been taking on new issues lately. Kentucky is where I've always been living. Invoicing is my profession.<br><br>My web-site psychic readers ([http://www.sandrakushnir.com/?p=337644 discover this info here]) |
| [[Image:Spurious Correlation (Colour).svg|thumb|300px|right|This figure shows the 500 observations of y/z plotted against x/z from above, this time with the z-values on a colour scale to highlight how dividing through by z induces spurious correlation.]]
| |
| | |
| '''Spurious correlation''' is a term coined by [[Karl Pearson]] to describe the [[correlation]] between ''ratios'' of absolute measurements that arises as a consequence of using ratios, rather than because of any actual correlations between the measurements.<ref name="Pearson">{{cite journal|last=Pearson|first=Karl|title=Mathematical Contributions to the Theory of Evolution—On a Form of Spurious Correlation Which May Arise When Indices Are Used in the Measurement of Organs|journal=Proceedings of the Royal Society of London|year=1897|volume=60|pages=489–498|url=http://www.jstor.org/stable/115879}}</ref>
| |
| | |
| The phenomenon of spurious correlation is one of the main motives for the field of [[Compositional data|compositional data analysis]] which deals with the analysis of variables that carry only ''relative'' information, such as proportions, percentages and parts-per-million.<ref name="Aitchison">{{cite book|last=Aitchison|first=John|title=The statistical analysis of compositional data|year=1986|publisher=Chapman & Hall|isbn=0-412-28060-4}}</ref><ref>{{cite book|title=Compositional Data Analysis: Theory and Applications|year=2011|publisher=Wiley|isbn=9780470711354|url=http://onlinelibrary.wiley.com/book/10.1002/9781119976462|editor1-first=Vera|editor1-last=Pawlowsky-Glahn|editor2-first=Antonella |editor2-last=Buccianti}}</ref>
| |
| | |
| Pearson's definition of spurious correlation is distinct from (and should not be confused with) misconceptions about [[Correlation and dependence#Correlation and causality|correlation and causality]], or the term [[spurious relationship]].
| |
| | |
| ==Illustration of spurious correlation==
| |
| | |
| Pearson states a simple example of spurious correlation:<ref name="Pearson" />
| |
| {{Quote|Select three numbers within certain ranges at random, say ''x'', ''y'', ''z'', these will be pair and pair uncorrelated. Form the proper fractions ''x''/''y'' and ''z''/''y'' for each triplet, and correlation will be found between these indices.}}
| |
| | |
| The upper scatter plot on the right illustrates this example using 500 observations of ''x'', ''y'', and ''z''. Variables ''x'', ''y'' and ''z'' are drawn from normal distributions with means 10, 10 and 30, respectively, and standard deviation 10, i.e.,
| |
| | |
| : <math>\begin{align}
| |
| x,y & \sim N(10,\sqrt{10}) \\
| |
| z & \sim N(30,\sqrt{10}) \\
| |
| \end{align}</math>
| |
| | |
| Even though ''x'', ''y'', and ''z'' are statistically independent (i.e., pairwise uncorrelated), the ratios x/z and y/z have a sample correlation of 0.53. This is because of the common divisor (''z'') and can be better understood if we colour the points in the scatter plot by the ''z''-value. Trios of (''x'', ''y'', ''z'') with relatively large ''z'' values tend to appear in the bottom left of the plot; trios with relatively small ''z'' values tend to appear in the top right.
| |
| | |
| ==Approximate amount of spurious correlation==
| |
| Pearson derived an approximation of the correlation that would be observed between two indices (<math>x_1/x_3</math> and <math>x_2/x_4</math>), i.e., ratios of the absolute measurements <math>x_1, x_2, x_3, x_4</math>:
| |
| | |
| : <math>
| |
| \rho = \frac{r_{12} v_1 v_2 - r_{14} v_1 v_4 - r_{23} v_2 v_3 + r_{24} v_2 v_4}{\sqrt{v_1^2 + v_3^2 - 2 r_{13} v_1 v_3} \sqrt{v_2^2 + v_4^2 - 2 r_{24} v_2 v_4}}
| |
| </math>
| |
| | |
| where <math>v_i</math> is the [[coefficient of variation]] of <math>x_i</math>, and <math>r_{ij}</math> the [[Correlation#Pearson.27s product-moment coefficient|Pearson correlation]] between <math>x_i</math> and <math>x_j</math>.
| |
| | |
| This expression can be simplified for situations where there is a common divisor by setting <math>x_3=x_4</math>, and <math>x_1, x_2, x_3</math> are uncorrelated, giving the spurious correlation:
| |
| | |
| : <math>
| |
| \rho_0 = \frac{v_3^2}{\sqrt{v_1^2 + v_3^2} \sqrt{v_2^2 + v_3^2}}.
| |
| </math>
| |
| | |
| For the special case in which all coefficients of variation are equal (as is the case in the illustrations at right), <math>\rho_0 = 0.5 </math>
| |
| | |
| ==Relevance to biology and other sciences==
| |
| Pearson was joined by [[Galton|Sir Francis Galton]] and [[Walter Frank Raphael Weldon]] in cautioning scientists to be wary of spurious correlation, especially in biology where it is common to scale or [[Normalization (statistics)|normalize]] measurements by dividing them by a particular variable or total. The danger he saw was that conclusions would be drawn from correlations that are artifacts of the analysis method, rather than actual “organic” relationships.
| |
| | |
| However, it would appear that spurious correlation (and its potential to mislead) is not yet widely understood. In 1986 [[John Aitchison]], who pioneered the log-ratio approach to [[Compositional data|compositional data analysis]] wrote:<ref name="Pearson" />
| |
| {{Quote|It seems surprising that the warnings of three such eminent statistician-scientists as Pearson, Galton and Weldon should have largely gone unheeded for so long: even today uncritical applications of inappropriate statistical methods to compositional data with consequent dubious inferences are regularly reported.}}
| |
| More recent publications suggest that this lack of awareness prevails, at least in molecular bioscience.<ref>{{cite book|first1=David|last1=Lovell|first2=Warren|last2=Müller|first3=Jen|last3=Taylor|first4=Alec|last4=Zwart|first5=Chris|last5=Helliwell|chapter=Chapter 14: Proportions, Percentages, PPM: Do the Molecular Biosciences Treat Compositional Data Right?|title=Compositional Data Analysis: Theory and Applications|year=2011|publisher=Wiley|isbn=9780470711354|url=http://onlinelibrary.wiley.com/book/10.1002/9781119976462|editor1-first=Vera|editor1-last=Pawlowsky-Glahn|editor2-first=Antonella |editor2-last=Buccianti}}</ref>
| |
| | |
| ==References==
| |
| {{reflist}}
| |
| | |
| | |
| | |
| [[Category:Covariance and correlation]]
| |
Hi there. Let me start by introducing the author, her name is Sophia. What me and my family adore is doing ballet but I've been taking on new issues lately. Kentucky is where I've always been living. Invoicing is my profession.
My web-site psychic readers (discover this info here)