|
|
Line 1: |
Line 1: |
| In [[statistics]], the '''Kendall rank correlation coefficient''', commonly referred to as '''Kendall's tau (τ) coefficient''', is a [[statistic]] used to measure the [[Association (statistics)|association]] between two measured quantities. A '''tau test''' is a [[non-parametric statistics|non-parametric]] [[hypothesis test]] for statistical dependence based on the tau coefficient.
| | Alyson is the name individuals use to contact me and I believe it seems quite great when you say it. My husband doesn't like it the way I do but what I really online reader ([http://www.prayerarmor.com/uncategorized/dont-know-which-kind-of-hobby-to-take-up-read-the-following-tips/ visit this website link]) like doing is caving but I don't have the time recently. I've usually cherished living in Kentucky but now I'm contemplating other choices. He is an purchase clerk and it's something he really appreciate.<br><br>Also visit my page phone [http://www.edmposts.com/build-a-beautiful-organic-garden-using-these-ideas/ live psychic reading] ([http://galab-work.cs.pusan.ac.kr/Sol09B/?document_srl=1489804 http://galab-work.cs.pusan.ac.kr/Sol09B/?document_srl=1489804]) |
| | |
| Specifically, it is a measure of [[rank correlation]], i.e., the similarity of the orderings of the data when [[ranked]] by each of the quantities. It is named after [[Maurice Kendall]], who developed it in 1938,<ref>{{cite journal
| |
| |last=Kendall |first=M.
| |
| |year=1938
| |
| |title=A New Measure of Rank Correlation
| |
| |journal=[[Biometrika]]
| |
| |volume=30 |issue=1–2 |pages=81–89
| |
| |doi=10.1093/biomet/30.1-2.81
| |
| |jstor = 2332226}}</ref> though [[Gustav Fechner]] had proposed a similar measure in the context of [[time series]] in 1897.<ref>{{cite journal
| |
| |last=Kruskal |first=W.H. |authorlink=William Kruskal
| |
| |year=1958
| |
| |title=Ordinal Measures of Association
| |
| |journal=[[Journal of the American Statistical Association]]
| |
| |volume=53 |issue=284 |pages=814–861
| |
| |mr=100941 | jstor = 2281954
| |
| |doi=10.2307/2281954
| |
| }}</ref>
| |
| | |
| ==Definition==
| |
| Let (''x''<sub>1</sub>, ''y''<sub>1</sub>), (''x''<sub>2</sub>, ''y''<sub>2</sub>), …, (''x''<sub>''n''</sub>, ''y''<sub>''n''</sub>) be a set of observations of the joint random variables ''X'' and ''Y'' respectively, such that all the values of (''x''<sub>''i''</sub>) and (''y''<sub>''i''</sub>) are unique. Any pair of observations (''x''<sub>''i''</sub>, ''y''<sub>''i''</sub>) and (''x''<sub>''j''</sub>, ''y''<sub>''j''</sub>) are said to be ''concordant'' if the ranks for both elements agree: that is, if both ''x''<sub>''i''</sub> > ''x''<sub>''j''</sub> and ''y''<sub>''i''</sub> > ''y''<sub>''j''</sub> or if both ''x''<sub>''i''</sub> < ''x''<sub>''j''</sub> and ''y''<sub>''i''</sub> < ''y''<sub>''j''</sub>. They are said to be ''discordant'', if ''x''<sub>''i''</sub> > ''x''<sub>''j''</sub> and ''y''<sub>''i''</sub> < ''y''<sub>''j''</sub> or if ''x''<sub>''i''</sub> < ''x''<sub>''j''</sub> and ''y''<sub>''i''</sub> > ''y''<sub>''j''</sub>. If ''x''<sub>''i''</sub> = ''x''<sub>''j''</sub> or ''y''<sub>''i''</sub> = ''y''<sub>''j''</sub>, the pair is neither concordant nor discordant.
| |
| | |
| The Kendall τ coefficient is defined as:
| |
| | |
| : <math>\tau = \frac{(\text{number of concordant pairs}) - (\text{number of discordant pairs})}{\frac{1}{2} n (n-1) } .</math><ref>{{SpringerEOM |title=Kendall tau metric |id=K/k130020 |first=R.B. |last=Nelsen}}</ref>
| |
| | |
| ===Properties===
| |
| The [[denominator]] is the total number pair combinations, so the coefficient must be in the range −1 ≤ ''τ'' ≤ 1.
| |
| * If the agreement between the two rankings is perfect (i.e., the two rankings are the same) the coefficient has value 1.
| |
| * If the disagreement between the two rankings is perfect (i.e., one ranking is the reverse of the other) the coefficient has value −1.
| |
| * If ''X'' and ''Y'' are [[Independence (probability theory)#Independent random variables|independent]], then we would expect the coefficient to be approximately zero.
| |
| | |
| ==Hypothesis test==
| |
| The Kendall rank coefficient is often used as a [[test statistic]] in a [[statistical hypothesis testing|statistical hypothesis test]] to establish whether two variables may be regarded as statistically dependent. This test is [[non-parametric statistics|non-parametric]], as it does not rely on any assumptions on the distributions of ''X'' or ''Y'' or the distribution of (''X'',''Y'').
| |
| | |
| Under the [[null hypothesis]] of independence of ''X'' and ''Y'', the [[sampling distribution]] of ''τ'' has an [[expected value]] of zero. The precise distribution cannot be characterized in terms of common distributions, but may be calculated exactly for small samples; for larger samples, it is common to use an approximation to the [[normal distribution]], with mean zero and variance
| |
| :<math>\frac{2(2n+5)}{9n (n-1)}</math>.<ref>{{SpringerEOM |title=Kendall coefficient of rank correlation |id=K/k055200 |first=A.V. |last=Prokhorov}}</ref>
| |
| | |
| ==Accounting for ties==
| |
| {{Refimprove|date=June 2010}}
| |
| A pair {(''x''<sub>''i''</sub>, ''y''<sub>''i''</sub>), (''x''<sub>''j''</sub>, ''y''<sub>''j''</sub>)} is said to be ''tied'' if ''x''<sub>''i''</sub> = ''x''<sub>''j''</sub> or ''y''<sub>''i''</sub> = ''y''<sub>''j''</sub>; a tied pair is neither concordant nor discordant. When tied pairs arise in the data, the coefficient may be modified in a number of ways to keep it in the range [-1, 1]:
| |
| | |
| ===Tau-a===
| |
| The Tau-a statistic tests the [[strength of association]] of the [[cross tabulation]]s. Both variables have to be [[ordinal level|ordinal]]. Tau-a will not make any adjustment for ties. It is defined as:
| |
| | |
| : <math>\tau_A = \frac{n_c-n_d}{n_0}</math> | |
| | |
| ===Tau-b===
| |
| The Tau-b statistic, unlike Tau-a, makes adjustments for ties.<ref>Agresti, A. (2010). Analysis of Ordinal Categorical Data, Second Edition. New York, John Wiley & Sons.
| |
| </ref> Values of Tau-b range from −1 (100% negative association, or perfect inversion) to +1 (100% positive association, or perfect agreement). A value of zero indicates the absence of association.
| |
| | |
| The Kendall Tau-b coefficient is defined as:
| |
| | |
| : <math>\tau_B = \frac{n_c-n_d}{\sqrt{(n_0-n_1)(n_0-n_2)}}</math>
| |
| | |
| where
| |
| | |
| : <math>\begin{array}{ccl}
| |
| n_0 & = & n(n-1)/2\\
| |
| n_1 & = & \sum_i t_i (t_i-1)/2 \\
| |
| n_2 & = & \sum_j u_j (u_j-1)/2 \\
| |
| n_c & = & \mbox{Number of concordant pairs} \\
| |
| n_d & = & \mbox{Number of discordant pairs} \\
| |
| t_i & = & \mbox{Number of tied values in the } i^{th} \mbox{ group of ties for the first quantity} \\
| |
| u_j & = & \mbox{Number of tied values in the } j^{th} \mbox{ group of ties for the second quantity}
| |
| \end{array}
| |
| </math>
| |
| | |
| ===Tau-c===
| |
| Tau-c differs from Tau-b as in being more suitable for rectangular tables than for square tables.
| |
| | |
| ==Significance tests==
| |
| When two quantities are statistically independent, the distribution of <math>\tau</math> is not easily characterizable in terms of known distributions. However, for <math>\tau_A</math> the following statistic, <math>z_A</math>, is approximately distributed as a standard normal when the variables are statistically independent:
| |
| | |
| : <math>z_A = {3 (n_c - n_d) \over \sqrt{n(n-1)(2n+5)/2} }</math>
| |
| | |
| Thus, to test whether two variables are statistically dependent, one computes <math>z_A</math>, and finds the cumulative probability for a standard normal distribution at <math>-|z_A|</math>. For a 2-tailed test, multiply that number by two to obtain the ''p''-value. If the ''p''-value is below a given significance level, one rejects the null hypothesis (at that significance level) that the quantities are statistically independent.
| |
| | |
| Numerous adjustments should be added to <math>z_A</math> when accounting for ties. The following statistic, <math>z_B</math>, has the same distribution as the <math>\tau_B</math> distribution, and is again approximately equal to a standard normal distribution when the quantities are statistically independent:
| |
| | |
| :<math>z_B = {n_c - n_d \over \sqrt{ v } }</math>
| |
| | |
| where
| |
| :<math>\begin{array}{ccl}
| |
| v & = & (v_0 - v_t - v_u)/18 + v_1 + v_2 \\
| |
| v_0 & = & n (n-1) (2n+5) \\
| |
| v_t & = & \sum_i t_i (t_i-1) (2 t_i+5)\\
| |
| v_u & = & \sum_j u_j (u_j-1)(2 u_j+5) \\
| |
| v_1 & = & \sum_i t_i (t_i-1) \sum_j u_j (u_j-1) / (2n(n-1)) \\
| |
| v_2 & = & \sum_i t_i (t_i-1) (t_i-2) \sum_j u_j (u_j-1) (u_j-2) / (9 n (n-1) (n-2))
| |
| \end{array}
| |
| </math>
| |
| | |
| ==Algorithms==
| |
| The direct computation of the numerator <math>n_c - n_d</math>, involves two nested iterations, as characterized by the following pseudo-code:
| |
| numer := 0
| |
| '''for''' i:=2..N '''do'''
| |
| '''for''' j:=1..(i-1) '''do'''
| |
| numer := numer + sgn(x[i] - x[j]) * sgn(y[i] - y[j])
| |
| '''return''' numer
| |
| | |
| Although quick to implement, this algorithm is <math>O(n^2)</math> in complexity and becomes very slow on large samples. A more sophisticated algorithm<ref>{{cite journal
| |
| |doi=10.2307/2282833
| |
| |last=Knight |first=W.
| |
| |year=1966
| |
| |title=A Computer Method for Calculating Kendall's Tau with Ungrouped Data
| |
| |journal=[[Journal of the American Statistical Association]]
| |
| |volume=61 |issue=314 |pages=436–439
| |
| | jstor = 2282833}}</ref> built upon the [[Merge Sort]] algorithm can be used to compute the numerator in <math>O(n \cdot \log{n})</math> time.
| |
| | |
| Begin by ordering your data points sorting by the first quantity, <math>x</math>, and secondarily (among ties in <math>x</math>) by the second quantity, <math>y</math>. With this initial ordering, <math>y</math> is not sorted, and the core of the algorithm consists of computing how many steps a [[Bubble Sort]] would take to sort this initial <math>y</math>. An enhanced [[Merge Sort]] algorithm, with <math>O(n \log n)</math> complexity, can be applied to compute the number of swaps, <math>S(y)</math>, that would be required by a [[Bubble Sort]] to sort <math>y_i</math>. Then the numerator for <math>\tau</math> is computed as:
| |
| | |
| :<math>n_c-n_d = n_0 - n_1 - n_2 + n_3 - 2 S(y)</math>,
| |
| | |
| where <math>n_3</math> is computed like <math>n_1</math> and <math>n_2</math>, but with respect to the joint ties in <math>x</math> and <math>y</math>.
| |
| | |
| A [[Merge Sort]] partitions the data to be sorted, <math>y</math> into two roughly equal halves, <math>y_{left}</math> and <math>y_{right}</math>, then sorts each half recursive, and then merges the two sorted halves into a fully sorted vector. The number of [[Bubble Sort]] swaps is equal to:
| |
| | |
| :<math>S(y) = S(y_{left}) + S(y_{right}) + M(Y_{left},Y_{right})</math>
| |
| | |
| where <math>Y_{left}</math> and <math>Y_{right}</math> are the sorted versions of <math>y_{left}</math> and <math>y_{right}</math>, and <math>M(\cdot,\cdot)</math> characterizes the [[Bubble Sort]] swap-equivalent for a merge operation. <math>M(\cdot,\cdot)</math> is computed as depicted in the following pseudo-code:
| |
| | |
| '''function''' M(L[1..n], R[1..m])
| |
| i := 1
| |
| j := 1
| |
| nSwaps := 0
| |
| '''while''' i <= n '''and''' j <= m '''do'''
| |
| '''if''' R[j] < L[i] '''then'''
| |
| nSwaps := nSwaps + n - i + 1
| |
| j := j + 1
| |
| '''else'''
| |
| i := i + 1
| |
| '''return''' nSwaps
| |
| A side effect of the above steps is that you end up with both a sorted version of <math>x</math> and a sorted version of <math>y</math>. With these, the factors <math>t_i</math> and <math>u_j</math> used to compute <math>\tau_B</math> are easily obtained in a single linear-time pass through the sorted arrays.
| |
| | |
| A second algorithm with <math>O(n \cdot \log{n})</math> time complexity, based on [[AVL tree]]s, was devised by David Christensen.<ref>{{cite journal
| |
| |last=Christensen |first=David
| |
| |year=2005
| |
| |title=Fast algorithms for the calculation of Kendall's τ
| |
| |journal=[[Computational Statistics]]
| |
| |volume=20 |issue=1 |pages=51–62
| |
| |doi=10.1007/BF02736122
| |
| }}</ref>
| |
| | |
| ==See also==
| |
| {{Portal|Statistics}}
| |
| * [[Correlation]]
| |
| * [[Kendall tau distance]]
| |
| * [[Kendall's W]]
| |
| * [[Spearman's rank correlation coefficient]]
| |
| * [[Goodman and Kruskal's gamma]]
| |
| * [[Theil–Sen estimator]]
| |
| | |
| ==References==
| |
| {{Reflist}}
| |
| * {{Cite book | last = Abdi |first=H. | contribution-url=http://www.utdallas.edu/~herve/Abdi-KendallCorrelation2007-pretty.pdf |contribution= Kendall rank correlation |editor-first=N.J. |editor-last=Salkind |title= Encyclopedia of Measurement and Statistics |location=Thousand Oaks (CA) |publisher= Sage| year = 2007 }}
| |
| * Kendall, M. (1948) ''Rank Correlation Methods'', Charles Griffin & Company Limited
| |
| | |
| ==External links==
| |
| * [http://www.statsdirect.com/help/nonparametric_methods/kend.htm Tied rank calculation]
| |
| * [http://www.rsscse-edu.org.uk/tsj/bts/noether/text.html Why Kendall tau?]
| |
| * [http://law.dsi.unimi.it/software/ Software for computing Kendall's tau on very large datasets]
| |
| * [http://www.wessa.net/rwasp_kendall.wasp Online software: computes Kendall's tau rank correlation]
| |
| * [http://www.technion.ac.il/docs/sas/proc/zompmeth.htm The CORR Procedure: Statistical Computations]
| |
| | |
| {{Statistics|descriptive}}
| |
| | |
| {{DEFAULTSORT:Kendall Tau Rank Correlation Coefficient}}
| |
| [[Category:Covariance and correlation]]
| |
| [[Category:Non-parametric statistics]]
| |
| [[Category:Statistical tests]]
| |
| [[Category:Statistical dependence]]
| |
| | |
| [[de:Rangkorrelationskoeffizient#Kendalls Tau]]
| |