Original articles
Should we always choose a nonparametric test when comparing two apparently nonnormal distributions?

https://doi.org/10.1016/S0895-4356(00)00264-XGet rights and content

Abstract

When clinical data are subjected to statistical analysis, a common question is how to choose an appropriate significance test. Comparing two independent groups with observations measured on a continuous scale, the question is typically whether to choose the two-sample-t test or the Wilcoxon–Mann–Whitney test (WMW test). Similar results are often obtained, but which conclusion can be drawn if significance tests give highly different P-values? The t test is optimal for normally distributed observations with common variance and robust to deviations from normality if sample sizes are not very small. The WMW test makes no distributional assumptions, but depends heavily on equal shape and variance of the two distributions (homoscedasticity). We have compared the properties of the traditional two-sample t test, a modified t test allowing unequal variance, and the WMW test by stochastic simulation. All show acceptable behaviour when the two distributions have similar variance. When variances differ, the modified t test is superior to the other two.

Introduction

An appropriate statistical analysis of clinical data demands choosing an adequate statistical method. One decision that has to be made is whether to employ a parametric or a nonparametric method. The methods are based on different assumptions, and if these are violated, the statistical analysis may lead to erroneous conclusions (i.e., P-values). The parametric methods make assumptions regarding the shape of the distribution of observations, whereas one is made to believe that nonparametric methods do not make such assumptions. The latter are therefore often referred to as distribution free.

If the assumptions made in the model formulation are incompatible with the data being analysed, the true significance level may be far from the nominal level that was specified. The common choice of nominal level is 5%. Obviously, we do not want to use a method with a higher probability of a false-positive result than planned. It may at first sight seem less obvious that a reduced level is a disadvantage, but a conservative test will usually lead to loss of power and the risk of drawing a false-negative conclusion will thus increase. In order to maintain the desired significance level and power of a test, it is therefore important to choose a test that applies to the problem at hand.

Clinical studies often compare the means of two independent groups of patients. If the observations are measured on a continuous scale, the statistical analysis is usually performed by the two-sample t test or the Wilcoxon–Mann–Whitney (WMW) test. There is no uniform agreement as to the choice between them, but the traditional recommendation seems to be a t test if the observations seem to be reasonably normally distributed, or if the number of patients in each sample is large, whereas the WMW test is recommended if sample sizes are small and the distributions seem to be skew.

Unfortunately, the choice between methods is not necessarily straightforward. Simple two-sample tests, whether parametric or not, make one important assumption: the two distributions that are compared are assumed to have the same shape and variance. This is referred to as a pure shift model or homoscedasticity. If the variances of the two distributions are not equal or if the distributions have different shape, this assumption is violated. Tests allowing unequal variance of two normal distributions were developed more than 60 years ago [1]. It has previously been shown that the WMW test and the t-test can have true significance levels that differ substantially from the nominal levels when two population variances are not equal 2, 3, 4, 5, 6, 7. This work has mostly been published in statistical journals. Some interest in the subject has been demonstrated in behavioural sciences [8], and the recommendations regarding the choice between tests based on power under certain distributional assumptions have been published (e.g., in Ref. [9]). The problem of heteroscedasticity (unequal shape or variance) seems generally not to be acknowledged in the medical literature and has received little attention in textbooks of applied statistics.

The aim of this study is to compare the properties of commonly used statistical tests. We have restricted the comparison to tests implemented in standard software: the two different versions of the t test (assuming equal and unequal variances, respectively) and the WMW test. Their properties are demonstrated in several situations that are typically met when clinical data are analysed. Various combinations of sample sizes, as well as shapes and variance of distributions are examined. The test properties are compared by stochastic simulation. A guide to choosing an appropriate test is also given.

Section snippets

A clinical example

After treatment with high-dose chemotherapy (HDT) a total of 35 patients with malignant lymphoma received peripheral blood progenitor cells (PBPC) mobilised with MIME/G-CSF [10]. Ten patients had Hodgkin's disease and 25 had non-Hodgkin's lymphoma. Time to neutrophil recovery was defined as the time from reinfusion of stem cells to the number of neutrophils exceeds 0.5 × 109/L. Fig. 1 shows the time to neutrophil recovery in each diagnosis group.

Table 1 shows the results of different

Definition of the tests

To test the hypothesis of equality of the means of two distributions, the two-sample t test is applied. Let xi denote the observations in Group A and y j the observations in Group B. The number of observations in each group is m and n, respectively. Then

x=i=1mxim

and

y=j=1nyjn

are the estimated means of the two distributions, and

sx2=i=1mxix2m−1

and

sy2=j=1nyjy2n−1

the estimated variances.

The t test is based on the statistic t=xysx2m−1+sy2n−1m+n−21m+1nand is known to be the best test if

Different models

Fig. 2A shows a pure shift model of two normal distributions. The distributions have the same variance; only the means differ. For this model, the two-sample t test is known to be the best test. Fig. 2B shows a pure shift model where distributions are skew with a heavy right tail (gamma distributions with shape parameter a=3). For situations similar to the one illustrated in Fig. 2B, the WMW is recommended if sample sizes are small.

Fig. 2C shows a situation with two normal distributions that do

Simulations

The properties of the three tests have been examined by stochastic simulation. The simulation programs were written in SIMULA [11] and executed on a SUN computer at the University of Oslo.

In the simulation program independent samples were drawn from two distributions with the same shape and mean, but possibly different variance. The ratio between the variances was varied between 1/9 and 9. In terms of S.D. this corresponds to 1/3 and 3. The parameter of importance is this ratio; the actual

Sample sizes equal, m=n=10

Fig. 3 shows estimated significance levels for the three tests when the observations are sampled from normal distributions. For graphical purposes, the S.D. ratio is presented rather than the ratio between variances. All three tests obtain the nominal (desired) level when variances are equal. Welch's U test, developed for situations with unequal variances, maintains the nominal level (0.05) throughout. The t test and the WMW test, however, have somewhat higher significance levels than desired

Choosing an appropriate test

A guide to choosing an appropriate test is given in Table 2. The effect of differences in variances is much more striking than the sensitivity to different types of distribution.

Based on the numerical studies above, recommendations can be given as to which P-value to report in the comparison of time to neutrophil recovery in Hodgkin's and non-Hodgkin's lymphoma. It has been demonstrated that the Welch U test is a better test than the other two when both shapes and variances differ. The

Discussion

Two slightly different versions of significance tests that allow unequal variances were proposed by Welch [1]. It has previously been shown that the properties of the so-called Welch's V test are marginally better than those of the U test 2, 12. Nevertheless, we have chosen to present the U test here, as the U test is implemented in most standard statistical software and therefore used in practice.

Simulations have been performed with a number of different distributions. In addition to the

References (12)

There are more references available in the full text version of this article.

Cited by (127)

View all citing articles on Scopus
View full text