Resampling Data Analysis Tool

Real Statistics Data Analysis Tool: The Real Statistics Resource Pack provides the Resampling data analysis tool which supports the following tests:

  • One-sample test (on the sample mean, median, 25% trimmed mean or variance)
  • Two paired samples (on the difference between sample means, medians, 25% trimmed means or variances)
  • Two independent samples (on the difference between sample means, medians, 25% trimmed means or variances)
  • One sample correlation test (on the correlation coefficient)
  • ANOVA (on the F statistic)

Bootstrapping can be used for all of these tests, while randomization (permutation) can be used for all but the first.

We describe the first three of these tests in this section, the sections for Correlation and ANOVA resampling are under construction.

We start by repeating Example 1 of Resampling (one-sample bootstrap on the data in range B3:B23 of Figure 1) using the Resampling data analysis tool and later we will comment more extensively about the data analysis tool.

To carry out Example 1 press Ctrl-m and double-click on the Resampling data analysis tool from the menu. Next fill in the dialog box that appears as shown in Figure 1 and click on the OK button.

Resampling dialog box

Figure 1 – Resampling dialog box

This time not only do we calculate the 95% confidence interval for the median, but we also calculate the p-value under the null hypothesis that the median is 6. We specify that 10,000 iterations be performed and a histogram of outcomes created with minimum and maximum bin sizes of 4 and 12 in increments of 1 unit. The output is shown in Figure 2 (slightly reformatted).

One sample bootstrap tool

Figure 2 – One sample bootstrap

We see from Figure 2 that the p-value for a two-tailed test is 0.0189 (cell E24) with a 95% confidence interval for the median of (6.5, 11.0). Since the hypothetical median (cell E18) of 6 is outside the confidence interval (and the p-value < α = .05), we reject the null hypothesis and conclude that the population median is significantly different from 6.

We also note that the sample median is 9.5 (cell E21) while the average value of the 10,000 medians is 9.413 (cell H20) with standard deviation 1.413 (cell H21). The p-value for a one-tail test (left tail) is .0188 (cell E22) and the p-value for the right tail is .0001 (cell E23).

Since there are 10,000 iterations and α = .05, the lower end of the confidence interval is the 1,000 · .025 = 250th smallest of the bootstrapped medians, while the upper end of the confidence interval is the 250th largest of the bootstrapped medians. The left p-value is equal to the number of bootstrapped medians smaller than 6 (the value of the median under the null hypothesis). Since the sample median minus the hypothetical median = 9.5 – 6.0 = 3.5, the right p-value is equal to the number of bootstrapped medians larger than 9.5 + 3.5 = 13.

Observation: For a One-sample test only Input Range 1 is used. For the other tests the range of the first sample is entered in Input Range 1 and the second is entered in Input Range 2. Alternatively, a two-column range can be entered in Input Range 1 where the first sample is contained in column 1 of the input range, the second sample is contained in column 2 of the input range and Input Range 2 is left empty.

The Hyp Stat Value (hypothetic statistic value) is only used for the One-sample and Correlation tests. For the one-sample test this is the value that you hypothesize as the value of the selected population statistic (mean, median, 25% trimmed mean or variance). For the one-sample correlation test this is the hypothesized value of the population correlation.

The number of iterations desired is inserted in the Iteration field. Any value up to 10 million can be selected, although you should recognize that the larger this value, the more accurate the result, but the slower the calculation.

For bootstrapping a histogram of the selected statistic is displayed using your values for the minimum and maximum bin values and bin size. In addition, the p-value and confidence interval are displayed (as shown in Figure 2). The confidence interval values are displayed only when the number of iterations is 65,000 or less. For randomizations a histogram and p-value is displayed but not a confidence interval.

Example 1: Perform a two-tailed randomization test for the two independent samples shown in range A3:B14 of Figure 3.

In addition to the two samples Figure 3 displays the results of the t-test and Shapiro-Wilk test for normality. As we can see the data is quite normally distributed, the first sample has mean 44.25 while the second sample has mean 29.50, although the two-tailed t test shows no significant difference between the population means (p-value = .07).

Two independent samples test

Figure 3 – Test of two independent samples

To carry out the randomization test, press Ctrl-m and double-click on the Resampling data analysis tool. When the dialog box shown in Figure 1 appears fill in Input Range 1 with A3: B14 and leave Input Range 2 empty (or insert A3:A14 in Input Range 1 and B3:B14 in Input Range 2). Make sure that Column headings included with data is unchecked and check the Independent samples, Mean and Randomization checkboxes.

Since we are performing a randomization test, the histogram will be centered around the mean value under the null hypothesis, namely zero. We therefore choose to use minimum and maximum bin values of -22 and 22 and bin size of 2.

Note that since the difference between the means of sample 1 and sample 2 is 44.25 − 29.50 = 14.75, if we had chosen to perform bootstrapping we would have chosen a bin minimum and maximum shifted about 15 units to the right.

The result of the randomization is shown in Figure 4 (slightly reformatted).

Randomization two independent samples

Figure 4 – Randomization two sample independence test

The fact that the data is normally distributed is confirmed by the shape of the histogram in Figure 4. Furthermore, the p-value = .0719 (cell L25), which is quite similar to the p-value resulting from the t test (cell AC19 of Figure 3). This is not surprising since the data is quite normally distributed.

Observation: The randomization version of the two-paired samples test is as described in Resampling. The bootstrap version is equivalent to a one-sample bootstrap test on the differences between the sample elements (e.g. range D3:D18 of Figure 5 for Example 3 in Resampling).

2 Responses to Resampling Data Analysis Tool

  1. Jason says:

    When using two independent samples of differing n, all I get is an error stating the “subscript is out of range”

Leave a Reply

Your email address will not be published. Required fields are marked *