1.gif (1892 bytes)

Essentials of Biostatistics

Indian Pediatrics 2000;37: 1210-1227.

10. Statistical Inference from Quantitative Data : Comparison of Means and Other Locations


A. Indrayan
L. Satyanarayana*

From the Division of Biostatistics and Medical Informatics, University College of Medical Sciences, Dilshad Garden, Delhi 110 095, India and *Institute of Cytology and Preventive Oncology, Maulana Azad Medical College Campus, Bahadur Shah Zafar Marg, New Delhi 110 002, India.

Correspondence to: Dr. A. Indrayan, Professor of Biostatistics, Division of Biostatistics and Medical Informatics, University College of Medical Sciences, Dilshad Garden, Delhi 110 095, India

E-Mail: [email protected]

The focus in this Article is on quantitative measurements that are generally summarized in terms of means. Hemoglobin level (Hb), serum zinc level, blood pressure, anthro-pometry such as weight and head circumference, etc., are examples of such measurements. We discussed in our earlier Articles(1,2) that mean is a statistic that depends on the pattern of distribution of values in the target population. Thus, forms such as Gaussian are specially important to draw inference in this case. Section 10.1 is on comparison of means. This could be with a specified value such as the mean Hb level of children given iron supplementation compared with a prespecified desirable value 14.0 g/dl, or could be the comparison of means of two independent groups such as cmparison of mean birth weight in pregnant women with and without vitamin A supplement. Comparison of means in paired data is given in Section 10.2. Comparing the mean hematocrit levels of a group of children with dengue fever before and after therapy is an example of paired comparison. This section also includes cross-over traials such as measurement of forced expiratory volume (FEV) in asthmatic children receiving two types of treatments in cross-over (or interchange) manner. The methods of these two sections are generally applicable when the underlying distribution is Gaussian. For non-Gaussian forms, particularly when n is small, the methods needed to compare locations are called non-parametric methods. The distribu-tion pattern in some restricted class of subjects such as blood glucose level in the diabetics and triglyceride level in obese can be nonGaussian. Nonparametric methods for comparing two independent and paired samples are discussed in Section 10.3. Section 10.4 is on comparision of means in three or more groups such as mean heart rate of children belonging to families with different dietary patterns. This section describes the popular ANOVA procedure. The last Section is a debate on real meaning of statistical significance so that it is not confused with medical significance.

 10.1 Comparison of Means - Independent Groups

Two different problems are included in this section. First is comparing the mean in a sample of subjects with a prespecified mean. Second is comparing mean in one sample with that in another sample. These samples would most likely represent two groups such as with and without disease, with mild and with serious form of disease, patients on drug-1 and on drug-2, male and female, etc. These setups would be clear when we discuss them in the following paragraphs.

 Comparision with a Prespecified Mean

Let the interest be in finding whether the children of chronic diarrhea have same average Hb level as normally seen in healthy subjects in that area. The normal level of Hb is suppose 14.6 g/dl. This is assumed to be known and fixed for the present example. Suppose a random sample of 10 children of chronic diarrhea is investigated and the average Hb level is found to be 13.8 g/dl. Can it be concluded with reasonable confidence on the basis of this sample that the patients of chronic diarrhea have lower Hb on average? How can we be confident that another sample will not give a value as much as in healthy children? After all this is a random sample and another sample can always give a different picture. There is only one sample in this example and the comparison is with known average in the healthy subjects. It is a one-sample problem though the comparison is of two means – one found in the sample and the other known for the healthy population.

The null hypothesis(3) in the above example is H0 : ΅ = 14.6 g/dL. This is also stated as: ΅0 = 14.6 g/dL. The possibility of a higher Hb level in children of chronic diarrhea is ruled out . Thus, the alternative hypothesis is one-sided. That is, H1 : ΅ <14.6 g/dL . If H0 is rejected then H1 is considered true. The difference is 13.8 – 14.6 = –0.8g/dL. This magnitude is assessed relative to the expected variation from sample to sample. As explained in an earlier Article(3), this is measured by standard error (SE) of mean. The ratio of the observed difference with its estimated SE leads to Student’s t-test when the underlying distribution is Gaussian. We can use this for testing hypothesis relating to the means of one or two samples. In case of one sample,

                           (Sample mean – ΅0)
Student’s tn – 1 = ––––––––––––––––––      [1]
 (One-Sample)               SE (mean)

where SE (mean) = SD/sqrt(n) and SD is estimated from the sample; ΅0 is hypothesized mean – the value of mean under H0. The value of t is referred at (n – 1) df to the standard Student’s t tables available in statistics text book to check that the probability of Type I error (P) is less or not less than the threshold such as 0.10, 0.05 or 0.01 that we earlier(3) called significance level. This procedure is same as already explained in our earlier Articles(3). These days computer is commonly available and a statistical package will give P-value(3) straight away. You only need to check that it is less than the predetermined significance level. When a computer package is available, there is no need to consult t-tables in statistics books.

Example 1: We pursue our example of Hb level in chronic diarrhea children. In a random sample of size   n = 10, suppose the levels in g/dl, are 11.5, 12.2. 14.9, 14.0, 15.4, 13.8, 15.0, 11.2, 16.1 and 13.9. These give mean and SD as follows:

"x = 13.8 g/dl and SD = 1.672 g/dl

The null hypothesis under test is that the average Hb level in children with chronic diarrhea is the same 14.6 g/dl as is normal in healthy children. Thus, ΅0 = 14.6 g/dl. The alternative is H1 : ΅ <14.6 g/dl since it can not be higher in chronic diarrhea children. Under H0,

        13.8 - 14.6
t9 = ––––––––––– = – 1.51
         1.672/ 10

A statistical package gives P (t<-1.51) = 0.0827. If the package is not available, the standard table of Student’s t gives P >0.05 for 9 df when t is 1.51. (Student’s t is symmetric and thus the P-value on the negative side is the same as on the positive side). This is the probability of getting the sample mean this much or more exreme in favor of H1 when mean actually is 14.6 g/dl. Since H1 is one-sided in this case, the probability required too is one-tailed. Thus, the chance is more than 5 per cent that H1 is true. Therefore, this H0 can not be rejected. Conclude that this sample of chiildren of chronic diarrhea does not support the contention that their Hb level is lower on average than 14.6 g/dl of healthy children.

 Means of Two Independent Groups

Consider a situation where means of a variable, for example, serum zinc level, are available from two groups. These two groups could be well-nourished and malnourished children, low birth weight and normal birth weight children, children suffering from acute diarrhea and persistent diarrhea, etc. Such groups are called independent groups because subjects included in both the groups are neither same nor overlapping. For this reason, the levels in one group do not affect the levels in the other group. This type of situation was discussed in the previous Article(4) for proportions. Interest now is on means. For the procedures discussed in this and other sections, it is necessary that the samples are randomly selected.

For the comparison of means by Student’s t-test, first step is to check that the variances (SD2) in two groups are not widely different. Generally, a ratio SD12/SD22 < 2 is considered adequate if each n is around 10 or 15. If n is very small, even a ratio of three can be tolerated. If n is 20 or more then a ratio of 2 may be too high. The conventional statistical test for H0 : σ12 = σ22 is F = SD12/SD22. However, it is necessary for this test (as well as for Student’s t test) that the under- lying distribution is Gaussian, at least approximately. To judge approximate Gaussianity of a set of data, check the bell-shape of the frequency distribution by plotting a histogram. Departure from Gaussianity can be suspected also if mean and variance of a set of data are nearly equal. A test for Gaussianity with a graphical check (probability plot) is also available in several statistical computer programs.

The procedure of performing Student’t t-test is different for equal variances than for unequal variances of the groups. The procedure for equal variances is given below. The null hypothesis in the comparison of means is H0 : ΅1 = ΅2. This states that the means in the two groups from which samples are drawn are equal. We want to test whether or not the sample observations in the two groups provide sufficient evidence to decide against this initial assumption.

Test of means when population variances are equal: In this case the two variances can be pooled to obtain a more reliable estimate of the variance. This is given by

Pooled variance:

         (n1 – 1)SD12 + (n2 – 1)SD22
sp2 = ––––––––––––––––––––––––
                  (n1 – 1) + (n2 – 1)

where SD1 and SD2 are standard deviations of the first sample and second sample respectively, and n1 is the size of the first sample and n2 of the second sample. Now, calculate

SE (mean difference) = sp sqrt(1/n1 + 1/n2) .

Then the test criterion is

                                         "x1 – "x2
Student’s t(n1+ n2 – 2) = ––––––––––––––––––
for two                          SE (mean difference)
independent
samples                                                   [2]

 

where "x1 and "x2 are the respective sample means. The degrees of freedom for t-test shown in expression [2] is n1 + n2 – 2. This is called pooled-variance Student’s t-test. Note that in this case also, just as in one-sample problem, Student’s t assesses the magnitude of difference against its standard error (SE). If the difference is large relative to SE, then we are more confident that the difference is not due to sampling fluctuation and is real.

Example 2: Suppose the sample sizes, and mean and SD of Hb levels in a random sample of well-nourished and under-nourished groups are as follows:

Well-nourished (group-1)

n1 = 100, "x1= 10.1 and SD1 = 0.9

Under-nourished (group-2)

n2 = 70, "x2 = 9.7 and SD2 = 1.1

The SDs do not differ too much and we can pool them. Thus,

sp2 = (99 x 0.92 + 69 x 1.12) / (100 + 70 – 2)

= 0.97.

In this case df = (n1 + n2 – 2) = (100 + 70 – 2) = 168, and the alternative hypothesis for acceptance is that average level in the well-nourished group is higher than in the under-nourished group. Now,

                10.1 – 9.7
t168 = –––––––––––––––––– = 2.61 .
           sqrt( 0.97 (1/100 + 1/70))

From standard Student’s t-table, P <0.05 since 2.61 is greater than the critical value 1.65 at 168 df in the table at one tail 5 per cent probability. This means that there is less than 5 per cent chance that the means in the two groups are equal. Thus the null hypothesis of equality of means can be safely rejected. There is enough evidence in these samples to conclude that the mean hemoglobin level of well-nourished group is higher from that of under-nourished. Note that the alternative hypothesis in this case is one-sided that says that Hb level is "higher" in the well-nourished group. When the null hypothesis of equality is rejected, this alternative is accepted.

Test for means when population variances are unequal: If the population variances are known to be unequal or if the sample variances are very different from one another then the pooled variance should not be obtained. In this case, the separate-variance Student’s t-test for comparision of means should be used. This involves lengthy formula and we do not want to burden you with mathematical details. The procedure is known as Beheren’s-Fisher problem. Interested reader may consult Snedecor and Cochran(5). A statistical package again will give you the P-value that you can use to draw correct inference. The package such as SPSS gives results for equal and unequal variances as well as the result that tests equality of variances. It is for the user to choose the right result. When the sample sizes are really large (say, more than 100) the procedure of equal variances explained in Example 2 can be used for unequal variances also.

 10.2 Comparison of Means - Paired Setup

Consider a setting where observations on a quantitative variable are available in group of patients at two different times. These could be lung functions in asthmatic children before and after therapy, or the values of T4, TSH, diastolic BP, systolic BP or heart rate in hypothyroid children before and after treatment.

In case of paired samples, obtain the differences between the pairs as di = (x2i – x1i), i=1,2,...,n, where n is the number of pairs. Then calculate SD of these differences as usual by

SDd =sqrt( (d – "d)2 /(n – 1)) .

The null hypothesis for rejection in this case is H0 : ΅1 = ΅2 where m1 is the mean before and m2 is the mean after. As already stated, these could be mean diastolic BP in hypothyroid children before and after treatment, mean forced expiratory volume in asthmatic children, etc. Under this H0, the Student’s t-test criterion for comparing paired means is:

                                  "d
Paired t-test : tn – 1 = ––––––– ,        [3]
                               SDd /sqrt(n)

where "d is the mean of the differences and SDd is the standard deviation of the diffrences. The criterion in equation [3] basically is the same as [1] for H0 : ΅d = ΅1 – ΅2 = 0. After the differences are obtained, the paired sample problem reduces to one-sample problem for these differences with the null hypothesis that the mean difference is zero. In this case also it is necessary that the underlying distribution of differences is approximately Gaussian unless n is large.

Example 3: Consider a prospective study to evaluate the role of intravenous pulse cyclophosphamide (IVCP) infusions in the management of nephrotic syndrome children with steroid resistances(6). Children were started on monthly infusion of IVCP in a dose of 500-750 mg/m2. Data on serum albumin levels of 14 children before and after IVCP infusions are given in Table I. We assume that the children were randomly selected. Mean of differences, SE of differences and test criterion for paired comparison are as follows.

Mean difference, "d = 1.507 and SDd = 0.7468.

Under H0 : ΅1 = ΅2.

           1.507
t13 = ––––––––––– = 7.55 .
        0.7468/ sqrt(14)

Since it is kown in this case that the albumin level after the treatment will increase and can not decrease, the alternative hypothesis is H1: ΅1 < ΅2. For this H1, one-tailed probability is needed. For (n–1) = 13 df, a statistical software gives P = 0.000 when t = 7.55. If the software is not available, the standard t-table can be used to obtain the P-value. Since t13 = 7.55 is more than the critical value 4.221 for P = 0.001 in Student’s t-table at 13 df, the one-tail probability is P <0.001. Since probability of Type I error is so small, there is practically no error in rejecting H0. Conclude that the mean albumin level after the tratment is higher than the mean before the treatment.

 Paired Samples from Cross-Over Design

Cross-over design is a very efficient strategy for trials on drugs that provide temporary relief. Arthritis, hypertension, asthma and migraine are examples of the disease that are relieved temporarily by the currently available drugs. Now we discuss a method for analysis of quantitative data from cross-over trials. The example given below may help to understand the procedure.

Example 4: Consider a trial on n = 16 asthma patients who were randomly divided into two equal groups of size 8. The first group received treatment-A (say, formoterol) then treatment-B (say, salbutamol) while the second group received treatment-B and then treatment-A. We abbreviate them as trA and trB. An adequate wash-out period was provided before switching the treatment so that there is no carry-over effect. The response variable is forced expiratory volume in one second (FEV1). The data obtained are given in Table II. Different tests of comparison in cross-over design are as follows.

Table I - Levels of Serum Albumin (g/dl) in Pre- and Post-IVCP in 14 Children with Nephrotic Syndrome
Pair No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Pre-IVCP 2.0 2.5 1.5 2.0  2.3 2.1 2.3 1.0 2.2 1.8 2.0 2.0 1.5 3.4
Post-IVCP 3.5 4.3 4.0  4.0 3.8 2.4 3.5 1.7 3.8 3.6 3.8 3.8 4.1 3.4
Diff (di) 1.5 1.8 2.5 2.0 1.5 0.3 1.2 0.7 1.6 1.8 1.8 1.8 2.6 0.0

IVCP: Intravenous pulse cyclophosphamide
Source: Gulati and Kher(6).

 

Table II - Forced Expiratory Volume in One Second (FEV1) Observed for 16 Asthma Patients in Two Treatment Sequences of a Cross-Over Design
 Group 1 - AB Sequence
 Subject No.  1 2 3 4 5 6 7 8
 FEV1 (1/min)                
 Period-1 trA  1.28 1.26  1.60 1.45 1.32  1.20 1.18 1.31
 Period-2 trB 1.25 1.27 1.47 1.38 1.31 1.18 1.20 1.27
 d1 : trA-trB + 0.03  – 0.01 0.13 0.07 0.01 0.02 – 0.02 0.04
  Group II - BA  Sequence
 Subject No. 9 10 11 12 13 14 15 16
 FEV1 (1/min)                
 Period-1 trB 1.27 1.49 1.05 1.38 1.43 1.31 1.25 1.20
 Period-2 trA 1.30 1.57 1.17 1.36 1.49 1.38 1.45 1.20
 d2 : trA-trB  0.03 0.08 0.12 – 0.02 0.06 0.07 0.20 0.00
 trA: treatment A; trB: treatment B.

Test for group effect: Note that in this case the groups identify the seaquence, and group effect is the same as the sequence effect. If sequence is not affecting the values, the mean difference between trA and trB should be the same in the two groups. Let the differences between trA and trB be d1 and d2 in Group I and Group II, respectively. These are given in Table II. The equality of means of these groups can be tested by the usual two sample t-test. In this case,

"d1 = 0.03375 "d2 = 0.06750; n1 = n2 = 8;

SD12 = 0.0023125, SD22 = 0.0048786;

sp2 = 0.0035955

Thus,

         0.03375 – 0.6750
t14 = ––––––––––––––––– = –1.13.
        sqrt(0.0035955(1/8+1/8))

From the standard t-table, the critical value at 14 df for α = 0.05 is 2.145. The obtained value 1.13 is lower. Thus, the difference in means is not statistically significant (P >0.05). Conclude that the sequence effect is not appreciable. If sequence effect is present, its reasons should be ascertained and the trial done again after eliminating those causes. In most practical situations where cross-over trial is used, the sequence of administering the drugs does not make much of a difference. The real possibility however is that of a carry-over effect.

Test for carry-over effect: If a positive carry-over effect is present, the performance of a regimen in period-2 would be better than its performance in period-1. Thus, the test for presence of a carry-over effect can be performed by comparing the performance of each regimen in the two periods. In our example, this is obtained by comparing trA values in period-1 with trA values in period-2. Similarly, for trB. Two separate t-test are required. It is possible that only one of the regimens has a long-term effect so that carry-over is present and the other has no such effect. Two two-sample t-tests would decide whether one or both have a carry-over effect. In our example (Table II), these are as follows:

(i) Mean FEV1 of period-1 with trA of Group I is comapred with mean FEV1 of period-2 with trA in Group II.

Treatment A:

Period-1 Period-2

mean = 1.325 mean = 1.365

SD = 0.1386 SD = 0.1387

n1 = n2 = 8; sp2 = 0.019224

Thus,

             1.365 – 1.325
t14 = –––––––––––––––––– = 0.577
        sqrt(0.019224 (1/8+1/8))

In case of positive carry-over effect, the average in period-2 should be higher than the average in period-1. Thus, the alternative is ΅d >0. This tells us that one-tail P-value is to be obtained. For t = 0.577, P>0.05. A high P-value says that the equality of means can not be ruled out. There is no evidence that a carry-over effect is present.

(ii) Mean FEV1 of period-1 with trB of Group I is compared with mean FEV1 of period-2 with trB in Group II.

Treatment B:

Period-1            Period-2

mean = 1.298    mean = 1.291

SD = 0.1391     SD = 0.0952

n1 = n2 = 8;      sp2 = 0.014206

Thus,

            1.291 – 1.298
t14 = ––––––––––––––––––– = – 0.117
        sqrt(0.014206 (1/8 + 1/8))

In this example, again, H1 : ΅d >0 where the difference is for (period-2 – period-1). Note that in computing t also, we have used (period-2 – period-1) values in the numerator. A negative value of t shows that the alternative hypothesis is not tenable at all in this case. There is no need to find P-value.

Test for tratment effect when no carry over effect is present: The two tests mentioned above are preliminaries. The primary purpose of the trial, of course, is to find whether one treatment is better than the other. This proceeds as follows. We assume that the sequence (or the Group) effect is not significant.

Consider two groups together as one in this case since the sequence is not important and there is no carry-over effect. Calculate difference trA – trB as shown in Table II and use paired t-test on the joint sample of 16 subjects by ignoring that they belonged to two groups. The mean and SD of 16 differences are "d = 0.0506 and SDd = 0.06049. Then the test criterion for paired differences is

           0.0506
t15 = –––––––––––– = 3.35 .
          0.06049/ sqrt(16)

From Student’s t-table for 15 df, this gives P >0.01. Thus, the treatment difference is statistically highly significant.

Procedure when carry-over effect is present: Cross-over is not a good strategy when carry-over effect is present. You may then increase the washout period and ensure that no carry-over effect is present. It is easy to say that a washout period will eliminate carry-over effect. In fact it can rarely be dismissed on a priori grounds. Psychological effect may persist even in case of blinding. Thus, a cross-over design should be used only after a fair assurance that carry-over effect is practically absent. As a side note, we may add that corresponding nonparametric tests are required in place of Student’s t when the sample size is small and the pattern is nonGaussian. These are discussed in the next section.

 10.3 Nonparametric Methods (Two Groups) - Independent and Paired

We continue with the setup where the response is quantitative, practically conti-nuous, and the interest is in comparing two groups. Suppose that a group of only 8 subjects is studied for blood glucose level in diabetic children that is known to follow a nonGaussian distribution. The difference from the conventional setup is that n is small and the underlying distribution of the response variable does not follow a Gaussian pattern. This can happen when, for example, studying duration of labor at the time of child birth, blood glucose level in the diabetics, hemolglobin level in the anemics, etc. The distribution pattern of these variables in such restricted class of subjects can be highly skewed. It is sometimes suggested that mathematical transformation of values such as log and square root should be tried that can achieve Gaussianity. We however do not like such transformations because they make interpretation difficult and sometimes fail to Gaussianise the pattern. Nonparametric methods, generally equated with distribution free methods, are needed for this setup. The term nonparametric implies a method that is not for any specific parameter such as mean. As already stated, these are the methods of choice when the underlying distribution is far from Gaussian and n is small. When the Gaussian conditions are present, their performance is not as good as that of Student’s t-test. As in the case of Student’s t-test, the nonparametric comparison of two groups is separately done in two types of situations: independent and paired. Among many nonparametric tests available, we describe two commonly used methods.

 Wilcoxon Rank Sum Test for Independent Samples

Suppose there are n1 observations in the first sample and n2 in the second sample. Wilcoxon rank-sum test can be used to test whether the location of the distribution from which these samples have been drawn are same. The meaning of location should be clear from Fig. 1. Four distributions are shown in this figure with different locations. Distributions B and C are located close to each other whereas A and D are far apart. There is no reference to mean or any such parameter in this case. The question we are investigating in this section is whether the samples provide sufficient indication that locations of the two groups are different. Let us explain the method with the help of an example.

Fig. 1. Four non-Gaussian distributions with different locations.

Example 5: Suppose we have data on duration (days) of fall of umbilical cord in newborns delivered by vaginal and Caesarean types as shown in Table III. In this Table, n1 = 11 for vaginal type and n2 = 17 for Caesarean type.

Table III - Hypothetical Data on Duration (in Days) of Fall of Umbilical Cord in Newborns Delivered by Vaginal and Caesarean Types

Group I                      
Vaginal: 3 5 4 6 7 8 7 9 8 3 8
Rank: 1.5 5.5 3.5 7.5 11  18 11 23.5 18 1.5 18
Rank sum :    119
Group II                                  
Caesarean: 8 9 5 6 7 8 7 8 8 4 8 7 8 26 10 11  17
Rank : 18 23.5 5.5 7.5 11 18 11 18 18 3.5 18 11 18 28 25 26  27
Rank sum :

  287

Ranks are obtained after combining Group I and Group II observations.

For the rank sum test, the two samples are combined and these n (= n1 + n2) observations are assigned ranks from lowest to the highest. In this example, n = 28. If two or more values are equal (called ties), each of them are assigned the average rank of the ranks they would have received individually had ties in the data not occurred. For convenience, we assume that the labels are such that n1 <= n2. In our example, smaller sample is for vaginal deliveries. Now,

Wilcoxon rank-sum criterion :

WR = Sum of the rank assigned to the n1 observations in the first sample (Group I).

For our example, Table III depicts ranks and rank sums of vaginal and Caesarean groups. Rank sum of the sample with smaller n (Group I) is 119, i.e., WR = 119. If there is no difference in the location of the two groups than these ranks would be randomly mixed in the groups and the rank sum would be proportionate. Some deviations can occur due to chance but not much.

Critical values of WR for α = 0.05 under the null hypothesis of no difference between the two groups are available in some text books(7) for one-sided and two-sided H1. Reject H0 if the calculated value of WR is equal to or beyond the critical value for your n1 and n2. No book is needed if you are using a computer package. The package will give you the P-value.

The rank-sum test for two independent samples will not give any statistical significance at 5% level if both n1 and n2 are less than 4. When any of these sample sizes is 10 or more (as in the case of our example), use the following Gaussian approximation:

Z = (WR – ΅WR) / αWR

where

–––––––––––––

΅WR = n1(n + 1)/2, αWR = sqrt(n1 n2(n + 1) / 12),

n1 <= n2 and n = n1 + n2.

For our example,

΅WR = 11 x 29/2 = 159.5 and

––––––––––––––––––––––

αWR = sqrt(11 x 17 x 29/12) = 21.26.

Therefore,

Z = (119 – 159.5) / 21.26 = 1.905.

From a statistical package, the two-tailed P value for this value of Z is 0.0599. Since this is more than 0.05, the null hypothesis of equality of average duration of fall of umbilical cord in newborns delivered by vaginal and Caesarean types cannot be rejected. Thus, evidence is not sufficiently strong in thse samples against the assumption that the days taken in fall of umbilical cord are same on average in vaginal deliveries and Caesarean deliveries. Note however that P-value in this case is on the margin – very close to 0.05. Thus a caution is required. Perhaps another investigation on a larger sample will provide a clinching evidence.

This is a hypothetical data. The findings of a original study conducted on a large sample is given by Singh et al.(8) that really says that the duration of fall is different in the two types of cases.

 Wilcoxon Signed-Rank Test for Matched Pairs

The following example describes various steps involved in the procedure of this test.

Example 6: Consider a study similar to the one given in Example 3 on IVCP infusions in the management of children with steroid resistance. The data on blood urea nitrogen (BUN) levels of 9 children before and after IVCP infusions are given in Table IV. The difference of BUN before and after infusions appear to have a skewed distribution and not follow a Gaussain pattern. The steps of the test are as follows.

Table IV - Computational Steps in Wilcoxon’s Signed Rank Test for the Data on Levels of Blood Urea Nitrogen (BUN) in Pre-and Post-IVCP Infusions for 9 Children with Nephrotic Syndrome
Sl. No. BUN (mg/dl) Difference(di)
Pre – 
Post
Rank of
difference
ignoring sign
Rank with
sign
attached
Pre-
IVCP
Post-
IVCP
(0) (1) (2) (3) (4) (5)
1 16 10 6.0 6.5 6.5
2 21 15 6.0 6.5 6.5
3 10 15 – 5.0 4.5 – 4.5
4 12 10 2.0 2
5 11 16 – 5.0 4.5  – 4.5
6 9 18 – 9.0  8 – 8
7 8 10 – 2.0 2 – 2
8 8 6 2.0 2 2
9 93 15 78.0 9 9

IVCP: Intravenous pulse cyclophosphamide.
Source: Gulati and Kher(6).
Note: Only a part of the nonGaussian type sample is taken for illustration.

Step 1. Let there be n pairs (n = 9 in our examples) and the observed value for the ith pair (xi , yi), i = 1,2,....,n. Calculate the difference di = (xi – yi) for each pair. For the purpose of illustration, we consider xi the value before the treatment and yi the value after the treatment. Such differences between pre- and post-infusion for BUN are shown in column 3 of Table IV.

Step 2. Ignore the "+" or "–" sign of di and assign rank starting from 1 to the smallest | di | to n' to the highest | di |, where n' is the number of pairs with nonzero difference. The pairs with zero difference are omitted. Thus n' £ n. In our example, there are no zero differences and n = n' = 9. Ties, if any, are given average rank. Column 4 of Table IV are ranks after ignoring sign of the differences.

Step 3. Reaffix the "+" or "–" sign of the difference to the respective ranks. That is, indicate which ranks arose from negative di's and which from positivie di's. This is done for our example in column 5 of Table IV.

Step 4. Calculate the Wilcoxon test criterion as the sum of the positive ranks. That is,

Wilcoxon signed rank test:

WS = Sum of the ranks with positive sign

= 26 for BUN example in Table IV.

Step 5. Check the calculated value of WS with the critical values of WS available for right-sided, left-sided and two-sided alternatives in Wilcoxon tables from a statistics book(7) to find that P-value is more or less than the significance level such as 0.05. Else, obtain P-value from a statistical package directly.

Step 6. Reject H0 of no difference between before and after measurement if (i) WS is more than or equal to the right-sided critical value for the case when before measurements are expected to be higher than after measurements, (ii) WS is less than or equal to the left-sided critical value for the case when before measurement are expected to be lower than after measurements (iii) WS lies outside the two-sided critical range for the case when before measurements are different (can be higher, can be lower) from after measurements.

For our example, the alternative hypothesis is right-sided as we expect that the before measurements to be higher than after measurements. For level 0.05 and n' = 9, the critical value for right-sided alternative from Wilcoxon’s tables is 37. Our WS value of 26 is less. Thus H0 can not be rejected. From statistical package, for the sum of positive ranks WS = 26, the one-tail P-value is 0.375. This is high and again gives the conclusion that the samples do not provide sufficient evidence against the null of equality of locations. Conclude that the average BUN levels before and after IVCP infusion can not be considered different in children from which this sample of 9 was randomly selected.

 10.4 Comaprison of Means in Three or More Groups

Statistical methods for evaluating signficance of difference between three or more groups remain conceptually simple but become mathematically complex. As always in this series, we avoid complex mathematical expressions and concentrate on explanations that may help in understanding the basic concepts, in being more judicious in choosing an appropriate method for a particular set of data, in realizing the limitations of the methods, and in interpreting the results properly.

Student’s t-test of Sections 10.1 and 10.2 is valid for almost any underlying distribution if n is large but requires Gaussian pattern if n is small. Similar condition also applies to the methods of comparison of means in three or more groups. The test criterion used now is called F. This test also requires that the variance in different groups is nearly same. The third, and more important, pre-requisite for validity of F is independence of observations. Thus, repeated observations on subjects such as blood pressure monitored after a surgery can not be directly subjected to an F-test. They need a separate method though that also ultimately uses F-test.

The generic method used for comparing means in three or more groups is called analysis of variance (anova). The name comes from the fact that the total variance in all the groups combined is broken down into components such as within-groups variance and between-groups variance. Between-groups variance is the systematic variation occurring due to group differentials. In Fig. 2, this is the variation among means in different groups shown by circles. The residual left after this extraction is considered random component arising due to intrinsic biologic variability between individuals. In Fig. 2, this is the variation of diamond shape dots from the mean of the respective group. This is the within-groups variance. If genuine group differentials are present then the between- groups variance should be large relative to the within-groups variance. Thus the ratio of these two components of variance can be used as a criterion to find that the group means are different or not. The common setups for such comparisons are one-way ANOVA and two-way ANOVA.

Fig. 2. A scatter diagram depicting group means and overall mean to understand the concept of the components of total variance.

To understand these concepts, consider a set-up shown in Table V. The table gives mean birth weights of babies born to mothers with different grades and different duration of smoking. Comparison of mean birth weights in different grades of smoking ignoring the duration (last row of the table) is a setup for one-way ANOVA. The comparison of mean birth weights in different duration of smoking ignoring grade of smoking (last column of the table) is also a one-way ANOVA. Consideration of both grade and duration of smoking simultaneously for comparing their respective group means and investigating the interaction (explained later) if any is a setup for two-way ANOVA. The two-way setup is more complicated and not discussed in this article although a brief is given later. The tests for both the setups can be easily done with the help of statistical package. The details for the one-way setup for comparison of different groups are given below.

Table V - Mean Birth Weights (kg) by Grade and Duration of Maternal Smoking - 
An Example of One-Way and Two-Way ANOVA Setup

Duration of maternal smoking Grade of maternal smoking
Mild Moderate Heavy All
– 18 weeks 3.45
(n=15)
3.42
(n=12)
3.43
(n=7)
3.44
(n=34)
18-31 weeks 3.38
(n=8)
3.40
(n=10)
3.39
(n=6) 
3.39
(n=24)
32+ weeks 3.35
(n=5)
3.30
(n=3)
3.18
(n=9) 
3.25
(n=17)
All 3.41
(n=28)
3.40
(n=25)
3.32
(n=22)
3.38
(n=75)
         
 One-Way ANOVA

Suppose we have data on heart rate (per min) during treadmill exercise in children aged 7 to 10 years and with mild, moderate and severe grades of anemia and control group with no anemia. Let the number of children in each group be 12 and let these be randomly chosen from a defined population. The null hypothesis in this case is H0: ΅1 = ΅2 = .... = ΅k with K = 4. This says that mean heart rate in all the grades of anemia are same. The alternative hypothesis is that at least one mean is different.

Let the sample sizes in different groups be n1, n2, n3, and n4. In our example, n1 = 12 = n2 = n3 = n4. The sample size for the combined group is n = n1 + n2 + n3 + n4 = 48. Let the mean and standard deviation when all the groups are combined be denoted by "x and SD, and for the individual groups by "x1, SD1; ""x2, SD2, "x3, SD3 and "x4, SD4. Suppose for example that these in combined and individual groups are: "x = 109.3 and SD = 22.89; "x1 = 93.86, SD1 = 22.85; "x2 = 114.6, SD2 = 24.32; "x3 = 109.8, SD3 = 13.47 and "x4 = 115.1, SD4 = 24.53.

The first step is to compute what is called the total sum of squares (TSS). This is numerator of the variance when calculated on the basis of all n observations. The TSS is computed from the deviations of the observations from the overall mean (see Fig. 2). For our example, let TSS = 24635. It is considered to have (n – 1) df. In our example this df is 47.

The sum of squares between groups (SSB) is computed from the deviations of the group means from the overall mean. The mathematical notation is omitted to reduce complexity. However, the findings obtained from a statistical package are reported below to understand the concept of F-test.

In our example, suppose SSB = 3205. For K groups the df for SSB is (K – 1). In this example, number of groups under comparison is K = 4 and thus df = 3. The last component of sum of squares is within-groups (SSW). This is obtained by subtracting SSB from TSS. Thus,

SSW = TSS - SSB.

= 24635 – 3205 = 21430.

This is called residual or error sum of squares also. The df for this residual SS is (n – K). In our example, degrees of freedom of SSW, df = 48 – 4 = 44.

The test criterion is

       SSB/(K – 1)       3205/3
F = ––––––––––– = ––––––––– = 2.19 .
       SSW/(n – K)      21430/44

When F <= 1, it is a sure indication that group means can be equal. If group means are different, the numerator in F ratio would be large and F would be substrantially more than 1. Just like c2 and t, the distribution of F under H0 is known. In place of one, the exact shape of F depends on a pair of df, viz., (K – 1), (n – K). The first corresponds to the numerator and the second to the denominator. For our example, these are (3, 44). The table value of F for (3, 44) df at 5 per cent level of significance is 2.82. The calculated value 2.19 is lower. If we reject the null hypothesis, the chances of error are more than 5 per cent. Thus, we can not reject the null hypothesis of equality of means in this case. This sample of children does not provide sufficient evidence to conclude that the mean heart rate in the four groups are different.

In this example, the groups (none, mild, moderate and severe anemia) are graded. The ANOVA procedure ignores this gradation and considers them as distinct groups. Once ANOVA reveals significant difference, procedures such as multiple comparison (discussed below) can be used to check that the heart rate is also graded or not. If the number of groups is large, regression can be used to find the gradient. We discuss this method in our next Article of this series.

If n is small, the ANOVA procedure is valid only when the distribution (of heart rate in our example) is Gaussian. Thus the skewed distributions of the type shown in Fig. 1 are admissible for ANOVA only when n is large.

 Multiple Comparisons

F-test is able to indicate that any mean is different or not. Once significance is concluded, the next step is to identify the groups that are different from one or more of the others. If there are four groups, the comparisons are group 1 with group 2, 1 with 3, 1 with 4, 2 with 3, 2 with 4 and 3 with 4. These are a total of six comparisons, and called multiple comparisons. You now know that means in two groups are generally compared by Student’s t-test. However, repeated applica;tion of this test, say, at 5 per cent level of significane on the same data, blows up the total probability of Type I error to an unacceptable level. If there are 15 such test, each done at 5 per cent level, then the overall (experiment wise) Type I error could be as high as 1 – (1 – 0.05)15 = 0.54. Compare this with the desired 0.05. To keep probability of Type I error within a specified limit such as 0.05, many procedures for multiple comparisons are available. Each of these is generally known by the name of the scientist who first proposed. Among them are Bonferroni, Tukey, Scheffe, Newman-Keul, Duncan and Dunnett. The last is used specially when each group is to be compared with the control only. The Bonferroni and Tukey procedures are commonly used in medical and health literature and, in our opinion, also the most suitable ones. For details of all these procedures, see Miller(9). We briefly describe the Bonferroni procedure.

Boneferroni Procedure: This is the most simple method to ensure that the probability α of Type I error does not exceed the desired level. Under this procedure, each comparison is done by using Student’s t-test but a difference is considered significant only if the corresponding P-value is less than α/H where H is the number of comparisons. If there are four groups and all pairwise comparisons are required then H = 6. Then a difference would be considered significant at 5 per cent level if p <0.05/6 i.e., if P <0.0083, otherwise not.

 Some Comments

We know that ANOVA should not be used when small sample size is available from non-Gaussian distribution. The nonparametric equivalent to one-way F is Kruskal-Wallis test. The details of this procedure are available in Conover(7).

We hope that the fundamentals of ANOVA are clear from the details given above. A slightly more complex situation is two-way ANOVA. Consider a clinical trial in which three doses (including a placebo) of a drug are given to a group of male and female subjects to assess rise in hematocrit (Hct) level. It is suspected that the effective dose may be different for males than for females. This differential response is called interaction. In this example, the interaction is likely between drug dose and gender. In some cases, evaluation of such interaction between factors could be important to draw valid conclusion. There are two factors in this trial: dose of the drug and gender of the subject. The response of interest is a quantitative variable, namely the percentage rise in Hct level. The objective of the trial is to find the effect of dose, of gender and of interaction on the response. Such a setup with two factors is called a two-way ANOVA. For details, refer Lindman(10).

Summary of different procedures for comparing means or other locations is given in Table VI.

Table VI - Summary of Procedures for Tests of Significance on Means or Other Locations

Setup Nature Criterion Section

  A. Comparison of two groups

(i) Independent Gaussian Student’s t (unpaired) 10.1
  Non Gaussian Wilcoxon rank-sum 10.3
  (small n)    
(ii) Paired Gaussian Student’s t (paired) 10.2
  Non Gaussian Wilcoxon signed-rank 10.3
  (small n)    
(iii) Paired in cross-over design Gaussian Student’s t   10.2
  Non Gaussian Wilcoxon signed-rank  10.3
  (small n)    

  B. Comparison of three or more groups

(i) One-way ANOVA     10.4
(ii) Higher-way ANOVA   Not discussed  
 10.5 When Significant is Not Significant

We started a debate in an earlier Article(3) on the real meaning of statistical significance and discussed how a statistically significant result may not have any medical significance. Now we carry this debate further and dilate it in this section.

 The Nature of Statistical Significance

Let us reemphasize that nearly all information in health and medicine is empirical in nature and the samples are a big source of uncertainty by themselves(11). We know that sample size plays a dominant role in statistical inference(3). It was demonstrated earlier that the SE could be substantially reduced by increasing n. Inference from a test of hypothesis can be drawn with minimal chance of error when n is large. However, a side effect of a large n is that a very small difference can become statistically significant. This difference may or may not be medically significant. On the other hand, a large and medically significant difference can fail to be statistically significant when sample size is small. Besides the sample size that can sometimes cause problem, the level of significance can also create confusion. Statistical significance is said to have been reached when the probability of Type I error is very low. While most use 0.05 as threshold but sometimes 0.10 or 0.01 is also used. Whereas P < 0.01 implies             P < 0.05, P < 0.10 does not imply P < 0.05. If P = 0.08 then the result would be statistically significant at α =0.10 but not at α = 0.05. This shows that caution is needed in drawing conclusions from statistical significance.

Statistical significance only means that the probability of no difference in the target population is extremely small. It does not say how much difference is present. If sample size is very large, even a small difference would become statistically significant. Statistical significance without proper medical explanation is rarely useful. However, such an explanation may not be immediately available and may emerge later.

A statistically significant difference is very likely to be real though there is a small chance that it is not. On the other hand, a real difference may not be statistically significant if n is small. Similarly, a large and medically important difference can also be statistically not significant if n is not sufficiently large.

 Presence of Medically Important Difference in Means

The null hypotheses discussed so far are for no difference. When this H0 is rejected, the only conclusion reached is that a difference is present. No statement can be made on the magnitude of difference on the basis of rejecting this H0. The difference could be very small that has no clinical implication or could be large enough to be medically important. For example, when can a new hypotensive drug be considered clinically effective – average decrease in diastolic BP by at least 2 mmHg, 5 mmHg, 8 mmHg or 10 mmHg? Suppose an iron supplementation program in female adolescents is organized. This raises the mean Hb level from 13.6 g/dl to 13.8 g/dl after intake for 30 days. Is this gain of 0.2 g/dl in average is sufficient to justify the program that entails substantial cost? In all such problems, the medical profession needs to decide the minimum acceptable or tolerable difference that justifies intervention.

Specified difference in means can be detected by taking H0 : ΅1 – ΅2 = ΅0 against, say H1 : ΅1 – ΅2 < ΅0. This can be tested by using formula [2] of Section 10.1 with a change in the numerator as ("x1 – "x2) – ΅0 instead of "x1 – "x2.

 Equivalence Tests

The primary aim of equivalence tests is to disprove a null hypothesis that two means, or any other summary measure, differ by clinically important amount. Thus equivalence tests are designed to demonstrate that important difference exists between a new and the current regimen. They can also be used to demonstrate stability of regimen over time, equivalence of two routes of dosage and equipotency.

Equivalence can be demonstrated either in the form of ‘at least as good as the present standard’ or as ‘neither better nor worse than the present standard’. The former is called clinical equivalence and the latter is called bioequivalence. In the case of clinical equivalence, the alternative hpothesis is one-sided, while in the case of bioequivalence it is two-sided. The latter can help to demonstrate that the dose delivered by the new drug is neither higher nor lower than that delivered by the standard. This is typical quality control equipotency or bioequivalence goal.

An equivalence study may have two independent groups, paired groups or a cross-over design. If the outcome is a quantitative variable, the null hypothesis is H0 : ΅1 – ΅2 = 0. This can be tested as usual by the methods already described.

 Balancing Type I and Type II Errors

It should now be clear that increasing one type of error decreases the other type and vice-versa(3). The two can rarely be simul-taneously kept low. A balancing act, of the type that we discuss for sensitivity and specificity(2), is needed.

We explained in our earlier article(3) that Type I error is like misdiagnosis or like punishing an innocent and thus is more serious than Type II error. Can Type II error be more hazardous than Type I error? We also mentioned that Type I error in the context of drug trials corresponds to an ineffective drug allowed to be marketed and Type II error corresponds to an effective drug denied entry into the market. In this sense, Type II error may have more seirous repercussions if the drug is for a dreaded disease like AIDS or cancer. If a drug can increase chance of 5-year survival from such a disease by, say 10%, it may be worth putting it into market provided side effects are not serious. Not much of Type II error can be tolerated in such cases. This error can be decreased (or power increased) first by increasing sample size but also by tolerating an increased Type I error. But, if the side effects were serious, a rethinking would be necessary. Thus, the two errors should be balanced keeping in mind their consequences.

  References
  1. Indrayan A, Satyanarayana L. Esentials of Biostatisttics: 4. Numerical methos to summarise data. Indian Pediatr 1999; 36: 1127-1134.

  2. Indrayan A, Satyanarayana L. Essentials of Biostatistics: 6. Refernece values in medicine and validity of diagnostic tests. Indian Pediatr 2000; 37: 285-291.

  3. Indrayan A, Satyanarayana L. Essentials of Biostatistics: 8. Basic philosophy of statistical tests, confidence intervals and sample size determination. Indian Pediatr 2000; 37: 739-751.

  4. Indrayan A, Satyanarayana L. Essentials of Biostatistics: 9. Statistical inference from qualitative data - Proportions, relative risks and odds ratios. Indian Pediatr 2000; 37: 967-981.

  5. Snedecor GW, Cochran WG. Statistical Methods, 7th edn. Ames, Iowa, The Iowa State University, 1980.

  6. Gulati S, Kher V. Intravenous pulse cyclophosphamide - A new regime for steroid resistant focal segmental glomerulosclerosis. Indian Pediatr 2000; 37: 141-148.

  7. Conover WJ. Practical Nonparametric Statistics, 3rd edn. New York, John Wiley and Sons, Inc. 1999.

  8. Singh N, Sharma S, Singh R. Umbilical cord fall in preterm and term newborns in vaginal and Caesarean deliveries. Indian Pediatr 1999; 36: 588-590.

  9. Lindman HR. Analysis of Variance in Experimental Design. New York, Springer-Verlag, 1992.

  10. Indrayan A, Satyanarayana L. Essentials of Biostatistics: 1. Medical uncertainties. Indian Pediatr 1999; 36: 476-483.

Next Article: Statistical relationships and the concept of multiple regression.

Home

Past Issue

About IP

About IAP

Feedback

Links

 Author Info.

  Subscription