en>Vadmium: Fix material implication link

2011-01-21T03:28:46Z

New page

In [[statistics]], the '''multiple comparisons''', '''multiplicity''' or '''multiple testing problem''' occurs when one considers a set of [[statistical inference]]s simultaneously<ref>{{cite book | last=Miller | first=R.G. | year=1981 | title=Simultaneous Statistical Inference 2nd Ed | publisher=Springer Verlag New York | isbn=0-387-90548-0}}</ref> or infers a subset of parameters selected based on the observed values.<ref>{{cite journal | journal=Biometrical Journal | title=Simultaneous and selective inference: Current successes and future challenges | year=2010 | volume=52 | last=Benjamini | first=Y. | pages=708–721 | doi=10.1002/bimj.200900299 | issue=6 | pmid=21154895}}</ref> Errors in inference, including [[confidence interval]]s that fail to include their corresponding population parameters or [[hypothesis test]]s that incorrectly reject the [[null hypothesis]] are more likely to occur when one considers the set as a whole. Several statistical techniques have been developed to prevent this from happening, allowing significance levels for single and multiple comparisons to be directly compared. These techniques generally require a higher significance threshold for individual comparisons, so as to compensate for the number of inferences being made.

==History==

The interest in problem of multiple comparisons began in the 1950s with the work of [[Tukey]] and [[Scheffé]]. More ideas were presented as the answer to the needs of [[medical statistics]]. New methods and procedures came out: [[Closed testing procedure]] (Marcus et al., 1976), [[Holm–Bonferroni method]] (1979). Later, in the 1980s, the issue of multiple comparisons came back. Books were published: Hochberg and Tamhane (1987), Westfall and Young (1993), and Hsu (1996). At 1995 the work on [[False discovery rate]] and other new ideas had begun. In 1996 the first conference on multiple comparisons took place in [[Israel]]. This meeting of researchers was followed by such conferences around the world: [[Berlin]] (2000), [[Bethesda, Maryland|Bethesda]] (2002),
[[Shanghai]] (2005), [[Vienna]] (2007), and [[Tokyo]] (2009). All these reflect an acceleration of increase of interest in multiple comparisons.<ref>{{cite journal | last1 = Benjamini | first1 = Y. | year = 2010 | title = Simultaneous and selective inference: Current successes and future challenges | url = | journal = Biom. J. | volume = 52 | issue = | pages = 708–721 | doi = 10.1002/bimj.200900299 | pmid=21154895}}</ref>

==The problem==

The term "comparisons" in multiple comparisons typically refers to comparisons of two groups, such as a treatment group and a control group. "Multiple comparisons" arise when a statistical analysis encompasses a number of formal comparisons, with the presumption that attention will focus on the strongest differences among all comparisons that are made. Failure to compensate for multiple comparisons can have important real-world consequences, as illustrated by the following examples.

* Suppose the treatment is a new way of teaching writing to students, and the control is the standard way of teaching writing. Students in the two groups can be compared in terms of grammar, spelling, organization, content, and so on. As more attributes are compared, it becomes more likely that the treatment and control groups will appear to differ on at least one attribute ''by random chance alone.''

* Suppose we consider the efficacy of a [[Pharmacology|drug]] in terms of the reduction of any one of a number of disease symptoms. As more symptoms are considered, it becomes more likely that the drug will appear to be an improvement over existing drugs in terms of at least one symptom.

* Suppose we consider the safety of a drug in terms of the occurrences of different types of side effects. As more types of side effects are considered, it becomes more likely that the new drug will appear to be less safe than existing drugs in terms of at least one side effect.

In all three examples, as the number of comparisons increases, it becomes more likely that the groups being compared will appear to differ in terms of at least one attribute. Our confidence that a result will generalize to independent data should generally be weaker if it is observed as part of an analysis that involves multiple comparisons, rather than an analysis that involves only a single comparison.

For example, if one test is performed at the 5% level, there is only a 5% chance of incorrectly rejecting the null hypothesis if the null hypothesis is true. However, for 100 tests where all null hypotheses are true, the expected number of incorrect rejections is 5. If the tests are independent, the probability of at least one incorrect rejection is 99.4%. These errors are called [[false positive]]s or [[Type I error]]s.

The problem also occurs for [[confidence intervals]], note that a single confidence interval with 95% [[coverage probability]] level will likely contain the population parameter it is meant to contain, i.e. in the long run 95% of confidence intervals built in that way will contain the true population parameter. However, if one considers 100 confidence intervals simultaneously, with coverage probability 0.95 each, it is highly likely that at least one interval will not contain its population parameter. The expected number of such non-covering intervals is 5, and if the intervals are independent, the probability that at least one interval does not contain the population parameter is 99.4%.

Techniques have been developed to control the false positive error rate associated with performing multiple statistical tests. Similarly, techniques have been developed to adjust confidence intervals so that the probability of at least one of the intervals not covering its target value is controlled.

===Classification of ''m'' hypothesis tests===

The following table gives a number of errors committed when testing <math>m</math> null hypotheses. It defines some random variables that are related to the <math>m</math> hypothesis tests.

{|class="wikitable"
! |
! Null hypothesis is True (H0)
! Alternative hypothesis is True (H1)
! | Total
|- align="center"
! | Declared significant
| <math>V</math>
| <math>S</math>
| <math>R</math>
|- align="center"
! | Declared non-significant
| <math>U</math>
| <math>T</math>
| <math>m - R</math>
|- align="center"
! Total
| <math>m_0</math>
| <math>m - m_0</math>
| <math>m</math>
|}

* <math>m</math> is the total number hypotheses tested
* <math>m_0</math> is the number of true [[null hypothesis|null hypotheses]]
* <math>m - m_0</math> is the number of true [[alternative hypothesis|alternative hypotheses]]
* <math>V</math> is the number of [[Type I and type II errors|false positives (Type I error)]] (also called "false discoveries")
* <math>S</math> is the number of [[Type I and type II errors|true positives]] (also called "true discoveries")
* <math>T</math> is the number of [[Type I and type II errors|false negatives (Type II error)]]
* <math>U</math> is the number of [[Type I and type II errors|true negatives]]
* <math>R</math> is the number of rejected null hypotheses (also called "discoveries")
* In <math>m</math> hypothesis tests of which <math>m_0</math> are true null hypotheses, <math>R</math> is an observable random variable, and <math>S</math>, <math>T</math>, <math>U</math>, and <math>V</math> are unobservable [[random variable]]s.

==Example: Flipping coins==
For example, one might declare that a coin was biased if in 10 flips it landed heads at least 9 times. Indeed, if one assumes as a [[null hypothesis]] that the coin is fair, then the probability that a fair coin would come up heads at least 9 out of 10 times is (10 + 1) × (1/2)10 = 0.0107. This is relatively unlikely, and under [[statistical significance|statistical criteria]] such as [[p-value]] < 0.05, one would declare that the null hypothesis should be rejected — i.e., the coin is unfair.

A multiple-comparisons problem arises if one wanted to use this test (which is appropriate for testing the fairness of a single coin), to test the fairness of many coins. Imagine if one were to test 100 fair coins by this method. Given that the probability of a fair coin coming up 9 or 10 heads in 10 flips is 0.0107, one would expect that in flipping 100 fair coins ten times each, to see ''a particular'' (i.e., pre-selected) coin come up heads 9 or 10 times would still be very unlikely, but seeing any coin behave that way, without concern for which one, would be more likely than not. Precisely, the likelihood that all 100 fair coins are identified as fair by this criterion is (1 − 0.0107)100 ≈ 0.34. Therefore the application of our single-test coin-fairness criterion to multiple comparisons would be more likely to falsely identify at least one fair coin as unfair.

==What can be done==
For hypothesis testing, the problem of multiple comparisons (also known as the ''multiple testing problem'') results from the increase in [[type I error]] that occurs when statistical tests are used repeatedly. If ''n'' independent comparisons are performed, the experiment-wide [[statistical significance|significance level]] <math> \bar{\alpha}</math>, also termed FWER for [[familywise error rate]], is given by
:<math> \bar{\alpha} = 1-\left( 1-\alpha_\mathrm{\{per\ comparison\}} \right)^n</math>.
Hence, unless the tests are perfectly dependent, <math>\bar{\alpha}</math> increases as the number of comparisons increases.
If we do not assume that the comparisons are independent, then we can still say:
:<math> \bar{\alpha} \le n \cdot \alpha_\mathrm{\{per\ comparison\}},</math>

which follows from [[Boole's inequality]]. Example: <math> 0.2649=1-\left( 1-.05 \right)^6 \le .05 \times 6=0.3</math>

There are different ways to assure that the familywise error rate is at most <math>\bar{\alpha}</math>. The most conservative, but free of independency and distribution assumptions method, is known as the [[Bonferroni correction]] <math> \alpha_\mathrm{\{per\ comparison\}}=\bar{\alpha}/n</math>. A more sensitive correction can be obtained by solving the equation for the familywise error rate of <math>n</math> independent comparisons for <math>\alpha_\mathrm{\{per\ comparison\}}</math>. This yields <math>\alpha_\mathrm{\{per\ comparison\}}=1-{\left(1-\bar{\alpha}\right)}^{\frac{1}{n}}</math>, which is known as the [[Bonferroni correction#.C5.A0id.C3.A1k correction|Šidák correction]]. Another procedure is the [[Holm–Bonferroni method]] which uniformly delivers more power than the simple Bonferroni correction, by testing only the most extreme p value (<math>i=1</math>) against the strictest criterion, and the others (<math>i>1</math>) against progressively less strict criteria.<ref>{{cite journal | last1 = Aickin | first1 = M | last2 = Gensler | first2 = H | year = | title = Adjusting for multiple testing when reporting research results: the Bonferroni vs Holm methods | url = | journal = Am J Public Health | volume = 1996 | issue = 86| pages = 726–728 }}</ref>
<math> \alpha_\mathrm{\{per\ comparison\}}=\bar{\alpha}/(n-i+1)</math>.

==Methods==
{{see also|Familywise error rate#Controlling procedures|False discovery rate#Controlling procedures}}
'''Multiple testing correction''' refers to re-calculating probabilities obtained from a statistical test which was repeated multiple times. In order to retain a prescribed familywise error rate α in an analysis involving more than one comparison, the error rate for each comparison must be more stringent than α. Boole's inequality implies that if each test is performed to have type I error rate α/''n'', the total error rate will not exceed α. This is called the [[Bonferroni correction]], and is one of the most commonly used approaches for multiple comparisons.

In some situations, the Bonferroni correction is substantially conservative, i.e., the actual familywise error rate is much less than the prescribed level α. This occurs when the test statistics are highly dependent (in the extreme case where the tests are perfectly dependent, the familywise error rate with no multiple comparisons adjustment and the per-test error rates are identical). For example, in fMRI analysis,<ref>{{cite doi|10.1016/j.neuroimage.2003.12.047}}</ref><ref>{{cite doi|10.1002/hbm.20471}}</ref> tests are done on over 100,000 [[voxel]]s in the brain. The Bonferroni method would require p-values to be smaller than .05/100000 to declare significance. Since adjacent voxels tend to be highly correlated, this threshold is generally too stringent.

Because simple techniques such as the Bonferroni method can be too conservative, there has been a great deal of attention paid to developing better techniques, such that the overall rate of false positives can be maintained without inflating the rate of false negatives unnecessarily. Such methods can be divided into general categories:
*Methods where total alpha can be proved to never exceed 0.05 (or some other chosen value) under any conditions. These methods provide "strong" control against Type I error, in all conditions including a partially correct null hypothesis.
*Methods where total alpha can be proved not to exceed 0.05 except under certain defined conditions.
*Methods which rely on an [[omnibus test]] before proceeding to multiple comparisons. Typically these methods require a significant [[ANOVA]]/[[Tukey's range test]] before proceeding to multiple comparisons. These methods have "weak" control of Type I error.
*Empirical methods, which control the proportion of Type I errors adaptively, utilizing correlation and distribution characteristics of the observed data.

The advent of computerized [[resampling (statistics)|resampling]] methods, such as [[bootstrapping (statistics)|bootstrapping]] and [[Monte Carlo simulation]]s, has given rise to many techniques in the latter category. In some cases where exhaustive permutation resampling is performed, these tests provide exact, strong control of Type I error rates; in other cases, such as bootstrap sampling, they provide only approximate control.

== Post-hoc testing of ANOVAs ==
Multiple comparison procedures are commonly used in an analysis of variance after obtaining a significant [[omnibus test]] result, like the [[ANOVA]] [[F-test]]. The significant ANOVA result suggests rejecting the global null hypothesis H0 that the means are the same across the groups being compared. Multiple comparison procedures are then used to determine which means differ. In a one-way ANOVA involving ''K'' group means, there are ''K''(''K'' − 1)/2 pairwise comparisons.

A number of methods have been proposed for this problem, some of which are:

;Single-step procedures
*[[Tukey's range test|Tukey–Kramer method]] (Tukey's HSD) (1951)
*[[Scheffe method]] (1953)

;Multi-step procedures based on [[Studentized range]] statistic
*[[Duncan's new multiple range test]] (1955)
* The [[Nemenyi test]] is similar to [[Tukey's range test]] in ANOVA.

* The [[Bonferroni–Dunn test]] allows comparisons, controlling the familywise error rate.{{Vague|date=February 2009}}
* [[Newman–Keuls method|Student Newman-Keuls]] [[post-hoc analysis]]
* [[Dunnett's test]] (1955) for comparison of number of treatments to a single control group.

Choosing the most appropriate multiple-comparison procedure for your specific situation is not easy. Many tests are available, and they differ in a number of ways.<ref>Howell (2002, Chapter 12: Multiple comparisons among treatment means)</ref>

For example,if the variances of the groups being compared are similar, the Tukey–Kramer method is generally viewed as performing optimally or near-optimally in a broad variety of circumstances.<ref>{{cite journal | journal=The American Statistician | title=The Status of Multiple Comparisons: Simultaneous Estimation of All Pairwise Comparisons in One-Way ANOVA Designs | year=1981 | volume=35 | last=Stoline | first=Michael R. | pages=134–141 | doi=10.2307/2683979 | issue=3 | publisher=American Statistical Association | jstor=2683979}}</ref> The situation where the variance of the groups being compared differ is more complex, and different methods perform well in different circumstances.

The [[Kruskal–Wallis test]] is the [[non-parametric]] alternative to ANOVA. Multiple comparisons can be done using pairwise comparisons (for example using [[Wilcoxon rank sum]] tests) and using a correction to determine if the post-hoc tests are significant (for example a [[Bonferroni correction]]).

== Large-scale multiple testing ==
Traditional methods for multiple comparisons adjustments focus on correcting for modest numbers of comparisons, often in an [[analysis of variance]]. A different set of techniques have been developed for "large-scale multiple testing", in which thousands or even greater numbers of tests are performed. For example, in [[genomics]], when using technologies such as [[DNA microarray|microarray]]s, expression levels of tens of thousands of genes can be measured, and genotypes for millions of genetic markers can be measured. Particularly in the field of [[genetic association]] studies, there has been a serious problem with non-replication — a result being strongly statistically significant in one study but failing to be replicated in a follow-up study. Such non-replication can have many causes, but it is widely considered that failure to fully account for the consequences of making multiple comparisons is one of the causes.

In different branches of science, multiple testing is handled in different ways. It has been argued that if statistical tests are only performed when there is a strong basis for expecting the result to be true, multiple comparisons adjustments are not necessary.<ref>{{cite journal | doi=10.1097/00001648-199001000-00010 | last=Rothman | first=Kenneth J. | journal=Epidemiology | volume=1 | pages=43–46 | year=1990 | title=No Adjustments Are Needed for Multiple Comparisons | issue=1 | publisher=Lippincott Williams & Wilkins | pmid=2081237 | jstor=20065622}}</ref> It has also been argued that use of multiple testing corrections is an inefficient way to perform [[empirical research]], since multiple testing adjustments control false positives at the potential expense of many more [[Type I and type II errors|false negatives]]. On the other hand, it has been argued that advances in [[measurement]] and [[information technology]] have made it far easier to generate large datasets for [[exploratory data analysis|exploratory analysis]], often leading to the testing of large numbers of hypotheses with no prior basis for expecting many of the hypotheses to be true. In this situation, very high [[false positive rate]]s are expected unless multiple comparisons adjustments are made.<ref>{{cite journal | last=Ioannidis | first=JPA | year=2005 | title=Why Most Published Research Findings Are False | journal=PLoS Med | volume=2 | doi=10.1371/journal.pmed.0020124 | pages=e124 | pmid=16060722 | issue=8 | pmc=1182327}}</ref>

For large-scale testing problems where the goal is to provide definitive results, the [[familywise error rate]] remains the most accepted parameter for ascribing significance levels to statistical tests. Alternatively, if a study is viewed as exploratory, or if significant results can be easily re-tested in an independent study, control of the [[false discovery rate]] (FDR)<ref>{{cite journal | last=Benjamini | first=Yoav | coauthors=Hochberg, Yosef | year=1995 | title=Controlling the false discovery rate: a practical and powerful approach to multiple testing | journal=[[Journal of the Royal Statistical Society, Series B]] | volume=57 | pages=125–133 | issue=1 | jstor=2346101}}</ref><ref>{{cite journal | last=Storey | first=JD | coauthors=Tibshirani, Robert | year=2003 | title=Statistical significance for genome-wide studies | journal=PNAS | volume=100 | pages=9440–9445 | doi=10.1073/pnas.1530509100 | pmid=12883005 | issue=16 | pmc=170937 | jstor=3144228}}</ref><ref>{{cite journal | last=Efron | first=Bradley | coauthors=Tibshirani, Robert; Storey, John D.; Tusher,Virginia | journal=[[Journal of the American Statistical Association]] | volume=96 | issue=456 | year=2001 | pages=1151–1160 | title=Empirical Bayes analysis of a microarray experiment | doi=10.1198/016214501753382129 | jstor=3085878}}</ref> is often preferred. The FDR, defined as the expected proportion of false positives among all significant tests, allows researchers to identify a set of "candidate positives", of which a high proportion are likely to be true. The false positives within the candidate set can then be identified in a follow-up study.

===Assessing whether any alternative hypotheses are true===
[[Image:quantile meta test.svg|thumb|325px|A [[Q-Q plot|normal quantile plot]] for a simulated set of test statistics that have been standardized to be [[standard score|Z-scores]] under the null hypothesis. The departure of the upper tail of the distribution from the expected trend along the diagonal is due to the presence of substantially more large test statistic values than would be expected if all null hypotheses were true. The red point corresponds to the fourth largest observed test statistic, which is 3.13, versus an expected value of 2.06. The blue point corresponds to the fifth smallest test statistic, which is -1.75, versus an expected value of -1.96. The graph suggests that it is unlikely that all the null hypotheses are true, and that most or all instances of a true alternative hypothesis result from deviations in the positive direction.]]

A basic question faced at the outset of analyzing a large set of testing results is whether there is evidence that any of the alternative hypotheses are true. One simple meta-test that can be applied when it is assumed that the tests are independent of each other is to use the [[Poisson distribution]] as a model for the number of significant results at a given level α that would be found when all null hypotheses are true. If the observed number of positives is substantially greater than what should be expected, this suggests that there are likely to be some true positives among the significant results. For example, if 1000 independent tests are performed, each at level α = 0.05, we expect 50 significant tests to occur when all null hypotheses are true. Based on the Poisson distribution with mean 50, the probability of observing more than 61 significant tests is less than 0.05, so if we observe more than 61 significant results, it is very likely that some of them correspond to situations where the alternative hypothesis holds. A drawback of this approach is that it over-states the evidence that some of the alternative hypotheses are true when the [[test statistic]]s are positively correlated, which commonly occurs in practice. {{citation needed|date=August 2012}}

Another common approach that can be used in situations where the [[test statistics]] can be standardized to [[standard score|Z-scores]] is to make a [[Q-Q plot|normal quantile plot]] of the test statistics. If the observed quantiles are markedly more [[statistical dispersion|dispersed]] than the normal quantiles, this suggests that some of the significant results may be true positives.{{citation needed|date=January 2012}}

==See also==
;Key concepts
*[[Familywise error rate]]
*[[False positive rate]]
*[[False discovery rate]] (FDR)
*[[False coverage rate]] (FCR)
*[[Interval estimation]]
*[[Post-hoc analysis]]
*[[Experimentwise error rate]]

;General methods of alpha adjustment for multiple comparisons
*[[Closed testing procedure]]
*[[Bonferroni correction]]
*Boole–[[Bonferroni bound]]
*[[Holm–Bonferroni method]]
*[[Testing hypotheses suggested by the data]]

==References==
{{Reflist|2}}

==Further reading==
* F. Betz, T. Hothorn, P. Westfall (2010), Multiple Comparisons Using R, CRC Press
* S. Dudoit and M. J. van der Laan (2008), Multiple Testing Procedures with Application to Genomics, Springer
* B. Phipson and G. K. Smyth (2010), Permutation P-values Should Never Be Zero: Calculating Exact P-values when Permutations are Randomly Drawn, Statistical Applications in Genetics and Molecular Biology Vol.. 9 Iss. 1, Article 39, {{doi|10.2202/1544-6155.1585}}
* P. H. Westfall and S. S. Young (1993), Resampling-based Multiple Testing: Examples and Methods for p-Value Adjustment, Wiley
{{Experimental design}}
{{Statistics}}

[[Category:Hypothesis testing]]
[[Category:Multiple comparisons| ]]

Implication graph - Revision history

en>Vadmium: Fix material implication link