Resampling procedures are based on the assumption that the underlying population distribution is the same as a given sample. The approach is to create a large number of samples from this pseudo-population using the techniques described in Sampling and then draw some conclusions from some statistic (mean, median, etc.) of the sample.
Resampling is generally simple to implement and doesn’t require complicated formulas. Unlike parametric techniques, few assumptions are made (e.g. data doesn’t need to be normal and samples don’t necessarily need to be large). Resampling is useful when the population distribution is unknown or other techniques are not available.
We consider two types of resampling procedures: bootstrapping, where sampling is done with replacement, and permutation (also known as randomization tests), where sampling is done without replacement. Generally bootstrapping is used for determining confidence intervals of some parameter, while randomization is used for hypothesis testing.
One sample case
Suppose that we would like to calculate a confidence interval for the median. Since there are no standard statistical tests for such confidence intervals, we approach the problem via bootstrapping as described in the following example.
Example 1: Calculate a 95% confidence interval around the median for the memory loss program described in Example 1 of the Sign Test, but with the data given in columns A and B of Figure 1.
Figure 1 – Resampling – One sample case
The sample has a mean of 9 and a median of 9.5.
We treat the sample as the population and draw 2,000 samples of size 20 (the same size as the original sample) with replacement. Referring to Figure 1, range D4:W4 represents the first sample, D5:W5 the second, etc. Each element in each sample is selected using the following function:
We now take the median of each of the 2,000 samples (only the first 21 samples are shown in Figure 1). E.g. cell X4 contains the formula =MEDIAN(D4:W4). Next we plot the distribution of the medians (i.e. range X4:X2003) in a histogram using Excel’s Histogram data analysis tool (or Excel’s charting capability), augmented with percentage and cumulative % columns. The results are shown in Figure 2.
Figure 2 – Analysis for Example 1
The value at the 2.5% percentile is 7 and the value at the 97.5% percentile is 10. Thus we can consider the confidence interval as [7, 11], which contains the sample median of 9.5.
Observation: Instead of using the formula =INDEX(B4:B23,RANDBETWEEN(1,20)), we could use the formula RANDOMIZE(B4:B23) based on the Real Statistics array function RANDOMIZE to select a sample of 20 data elements with replacement.
Two independent samples
We now consider the case where we have two independent samples. When the data is normally distributed, we would use the t-test (for independent samples with equal variances or with unequal variances). We can also use the Wilcoxon Rank Sum or Mann-Whitney non-parametric test. We now show how to address such problems using the permutation version of resampling.
Example 2: Using resampling determine whether there is a significant difference between the median life expectancy of smokers and non-smokers using the data described in Figure 3 (this is Example 3 from the Wilcoxon Rank Sum Test).
Figure 3 – Data for Example 2
Note that the median score of the non-smokers is 76.5 while the median score of smokers is 70.5, a difference of 6.
The null hypothesis is that there is no difference between the two groups, i.e.
H0: the median score for the population of smokers and non-smokers are the same.
Based on the null hypothesis, we can assume that we have a single population of 78 (represented by the combined sample of smokers and non-smokers). To test the hypothesis we take 2,000 random samples of size 78 from this population without replacement and assume that for each sample the first 40 scores come from the non-smokers and the remaining 38 come from the smokers.
To draw these samples we use the approach described in Sampling, namely we use formulas of form
where the range J4:CI4 contains all 78 data elements in the “population” and DC6:GB6 contains 78 random numbers, generated using RAND(). For each of the 2,000 samples we calculate the median of the non-smokers and smokers and record the difference. A histogram of these median differences is provided in Figure 4.
Figure 4 – Resampling for two independent samples
Now we need to check whether the mean difference of the original sample is in the extreme 2.5% of the above data (2-tail test). From Figure 14.20, we see that 1.60% of the samples have a median difference of -6 or less and 2.35% of the samples have a median difference of 6 or more, for a total of 3.95%. This means that the probability of getting a sample in either tail based on the null hypothesis is .0395 < .05 = α , and so we reject the null hypothesis and conclude with 95% confidence that there is a significant difference between the life expectancy of smokers and non-smokers.
Observation: If we had used a one tail test, then p-value = .0235 < .05 = α and so we more comfortably reject the null hypothesis.
In the previous example we chose to test the median. Using the same technique, we could have chosen to test the mean instead.
Observation: Instead of using the formula =INDEX(J4:CI4,1,RANK(DC6,DC6:GB6))), we could use the formula SHUFFLE(J4:CI4) based on the Real Statistics array function SHUFFLE to select a sample from the original 78 data elements without replacement.
Two matched samples
We now consider the case where we have two matched samples. When the data is normally distributed (or at least symmetric), we would use the Paired Sample t-test. Even for non-normal data we can use the Wilcoxon Signed-Ranks non-parametric test. We now show how to address such problems using resampling techniques.
Example 3: Using resampling determine whether there is a significant difference between the median life expectancy of smokers and non-smokers using the data described in Figure 3 (this is Example 1 from the Wilcoxon Signed-Ranks Test for Paired Samples)
The null hypothesis is there is no difference between a person’s ability to identify objects with their right eye from their ability with their left eye, i.e. the median difference is zero. As we have seen previously the data is skewed and so it might be better not to use the t-test. We will use resampling and assume that the population is as in the sample.
If the null hypothesis is true then each of the 15 scores for the right eye is just as likely to be larger as smaller than the scores for their left eye, and so we can randomly exchange the scores of each person’s eyes. This is equivalent to randomly changing the sign of the difference between the scores. Thus, we take 2,000 samples each of size 15 (the size of the sample) using the sample data but randomly assigning the sign of the difference as positive or negative (with a 50% probability of each outcome).
This is a form of sampling without replacement. The absolute values of the elements in each sample are as in the population, only the signs vary.
Figure 5 – Resampling for paired samples
Figure 5 shows the first 16 samples (out of 2,000). The range F3:T3 contains the differences of the original data for the first sample. Each of the 15 data elements in the first sample are generated using the formulas
and similarly for the other 1,999 samples. For each sample we calculate the median and create a histogram of the 2,000 median values as shown in Figure 6.
Figure 6 – Analysis for Example 3
The median of the original sample (i.e. the resampling “population”) is MEDIAN(D4:D18) = 3. From Figure 6 we see that 10.00% all the samples have a median ≤ -3 and 12.30% have a median ≥ 3. Since 10.00 + 12.30% = 22.30% ≥ 5% = α, we cannot reject the null hypothesis, and so conclude there is no significant difference between the right and left eye of the population.