In order to use MANOVA the following assumptions must be met:
- Observations are randomly and independently sampled from the population
- Each dependent variable has an interval measurement
- Dependent variables are multivariate normally distributed within each group of the independent variables (which are categorical)
- The population covariance matrices of each group are equal (this is an extension of homogeneity of variances required for univariate ANOVA)
These assumptions are similar to those for the Hotelling’s T-square test (see Hotelling’s T-square for Two samples). In particular we test for multivariate normality and homogeneity of covariance matrices in a similar fashion.
Multivariate normality: If the samples are sufficiently large (say at least 20 elements for each dependent × independent variable combination), then the Multivariate Central Limit Theorem holds and we can assume the multivariate normality assumption holds. If not, we would need to check that the data (or residuals) for each group is multivariate normally distributed. Fortunately, as for the Hotelling’s T-square test, MANOVA is not very sensitive to violations of multivariate normality provided there aren’t any (or at least many) outliers.
Univariate normality: We start by trying to show that the sample data for each combination of independent and dependent variable is (univariate) normally distributed (or at least symmetric). If there is a problem here, then the multivariate normality assumption may be violated (of course you may find that each variable is normally distributed but the random vectors are not multivariate normally distributed).
For Example 1 of Manova Basic Concepts, for each dependent variable we can use the ExtractCol supplemental function to extract the data for that variable by group and then use the Descriptive Statistics and Normality supplemental data analysis tool contained in the Real Statistics Resource Pack.
E.g. for the Water dependent variable (referring to Figure 1 of Manova Basic Concepts and Figure 3 of Real Statistics Manova Support), highlight the range F5:I13 and enter the array formula =ExtractCol(A3:D35,”water”). Then enter Ctrl-m and select Descriptive Statistics and Normality from the menu. When a dialog box appears, enter F5:I13 in the Input Range and chose the following options: Column headings included with data, Descriptive Statistics, Box Plot and Shapiro-Wilk and then click on OK. The resulting output is shown in Figure 1.
Figure 1 – Tests for Normality for Water
The descriptive statistics don’t show any extreme values for the kurtosis or skewness. We can see that the box plots are reasonably symmetric and there aren’t any prevalent outliers. Finally the Shapiro-Wilk test shows that none of the samples shows a significant departure from normality.
The results are pretty similar for Yield. Also the results for Herbicide show that the sample is normally distributed, but the box plot shows that there may be a potential outlier. The kurtosis value shown in the descriptive statistics for loam is 3.0578, which indicates a potential outlier. We return to this issue shortly.
We can also construct QQ plots for each of the 12 combinations of groups and dependent variables using the QQ Plot data analysis tool provided by the Real Statistics Resource Pack. For example, we generate the Water × Clay QQ plot as follows: press Ctrl-m, select the QQ Plot from the menu and then enter F6:F13 (from Figure 1) in the Input Range and click on OK. The chart that results, as displayed in Figure 2, shows a pretty good fit with the normal distribution assumption (i.e. the points lie close to the straight line).
Figure 2 – QQ plot Water × Clay
Multivariate normality: It is very difficult to show multivariate normality. One indicator is to construct scatter plots for the sample data for each of pair of dependent variables. If the distribution is multivariate normal the cross sections in two dimensions should be in the form of an ellipse (or straight line in the extreme case). E.g. for Yield × Water, highlight the range B4:C35 and then select Insert > Charts|Scatter. The resulting chart is shown in Figure 3.
Figure 3 – Scatter plot for Yield × Water
To produce the scatter plot for Water × Herbicide, similarly highlight the range C4:D35 and select Insert > Charts|Scatter. The case of Yield × Herbicide is a bit more complicated: highlight the range B4:D35 (i.e. all three columns of data) and select Insert > Charts|Scatter as before. The resulting chart is shown in Figure 4.
Figure 4 – Scatter plots for Yield × Water and Yield × Herbicide
Series 1 (in blue) represents Yield × Water and series 2 (in red) represents Yield × Herbicide. Click on any of the points in series 1 and hit the Delete (or Backspace) key. This erases the blue series and only the desired red series remains. Adding the title and removing the legend produces the scatter chart in Figure 5.
Figure 5 – Scatter plot for Yield × Herbicide
All three scatter plots are reasonably elliptical, supporting the case for multivariate normality.
Outliers: As mentioned above, the multivariate normality assumption is sensitive to the presence of outliers. Here we need to be concerned with both univariate and multivariate outliers. If outliers are detected they can be dealt with in a fashion similar to the univariate case.
Univariate outliers: For the univariate case, generally we need to look at data elements with a z-score of more than 3 or less than -3 (or 2.5 for smaller samples, say less than 80 elements). For data that is normally distributed (which, of course, we are assuming is true of our data), the probability of a z-score of more than +3 or less than -3 is 2*(1–NORMSDIST(3)) = 0.0027 (i.e. about 1 in 370). The probability of a z-score of more than 2.5 or less than -2.5 is 0.0124 (i.e. about 1 in 80).
The figures 2.5 and 3.0 are somewhat arbitrary, and different estimates can be used instead. In any case, even if the data element can be classified as a potential outliers based on this criteria, it doesn’t mean that it should be thrown away. The data element may be perfectly reasonable (e.g. in a sample of say 1,000 elements, you would expect at least one potential outlier 1 – (1-.0027)1000 = 93.3% of the time.
Since suspect that there is an outlier in the herbicide sample, we will concentrate on the data in that sample. We first use the supplemental array formula =ExtractCol(A3:D35,”herbicide”) to extract the herbicide data. We then look at the box plot and investigate the outliers for that sample using Real Statistics’ Descriptive Statistics and Normality data analysis tool as follows: enter Ctrl-m, select Descriptive Statistics and Normality from the menu, enter F33:I41 in the Input Range and choose the Column headings included with data, Box Plot and Outliers and Missing Data options. The output is shown in Figure 6.
Figure 6 – Investigation of potential outliers in Herbicide data
As mentioned previously, the Box Plot (see Figure 6) for herbicide in Example 1 of Manova Basic Concepts indicates a potential outlier, namely the data element in cell G38. The z-score for this entry is given by the formula (cell S13).
STANDARDIZE(G38,AVERAGE(G34,G41),STDEV(G34,G41)) = 2.16
This value is still less than 2.5, and so we aren’t too concerned. In fact the report shows there are no potential outliers.
Multivariate outliers: Multivariate outliers are harder to spot graphically, and so we test for these using the Mahalanobis distance squared. For any data sample X with k dependent variables (here X is an k × n matrix) with covariance matrix S, the Mahalanobis distance squared, D2, of any k × 1 column vector Y from the mean vector of X (i.e. the center of the hyper-ellipse) is given by
Since the data in standard format is represented by an n × k matrix, we look at the row equivalent version of the above formula, namely, for any data sample X with k dependent variables with covariance matrix S, the Mahalanobis distance squared, D2, of any 1 × k row vector Y is given by
To check for outliers we calculate D2 for all the row vectors in the sample. This can be done using the Real Statistics MANOVA data analysis tool, this time choosing the Outliers options (see Figure 1 of Real Statistics Manova Support). The output is displayed in Figure 7.
Figure 7 – Using Mahalanobis D2 to identify outliers
Here the covariance matrix for the sample data (range I4:K6) is calculated by the array formula
Or simply COV(B4:D35) using the supplemental function COV. The inverse of the covariance matrix (range I9:K11) is then calculated by the array formula MINVERSE(I4:K6).
The values of D2 can now be calculated as described above. E.g. D2 for the first sample element (cell F4) is calculated by the formula
The values of D2 play the same role as the z-scores in identifying multivariate outliers. Since the original data is presumed to be multivariate normal, by Property 3 of Multivariate Normal Distribution Basic Concepts, the distribution of the values of D2 is chi-square with k (= the number of dependent variables) degrees of freedom. Usually any data element whose p-value is < .001 is considered to be a potential outlier. As in the univariate case this cutoff is somewhat arbitrary.
For Example 1 of Manova Basic Concepts, the p-values are displayed in column G of Figure 7. E.g. the p-value of the first sample element is calculated by the formula
Any element that is a potential outlier is indicated by an asterisk in column H. We note that none of the p-values in column G is less than .001 and so there are no potential multivariate outliers.
If the yield value for the first sample element (cell B4) is changed from 76.7 to 176.7, then the D2 value in cell G4 would change to 96.76 and the p-value would now become 7.71E-21, which is far below .001. While the value 176.7 might be correct, it would be so much higher than the other yields obtained that we would probably suspect that it was a typing mistake and check to see that the correct value is 76.7.
Real Statistics Function: The following function is supplied by the Real Statistics Resource Pack:
MDistSq(R1,R2): the Mahalanobis distance squared between the 1 × k row vector R2 and the mean vector of the sample contained in n × k range R1
For example, the Mahalanobis distance squared between the row vector R2 = (50, 25, 5) and the mean of the sample R1 = A4:D35 is MDistSq(R1, R2) = 1.072043.
Homogeneity of covariance matrices: As for the Hotelling’s T-square test, MANOVA is not so sensitive to violations of this assumption provided the covariance matrices are not too different and the sample sizes are equal.
If the sample sizes are unequal (generally if the largest sample is more than 50% bigger than the smallest), Box’s Test can be used to test for homogeneity of covariance matrices (see Box’s Test). This is an extension of Bartlett’s Test as described in Homogeneity of Variances. As mentioned there, caution should be exercised and many would recommend not using this test since Box’s Test is very sensitive to violations of multivariate normality.
If the larger samples also have larger variance then the MANOVA test tends to be robust for type I errors (with a loss in power). If the smaller sized samples have larger variance then you should have more confidence when retaining the null hypothesis than rejecting the null hypothesis. Also you should use a more stringent test (Pillai’s instead of Wilk’s).
Since the sample sizes for Example 1 of Manova Basic Concepts are equal, we probably don’t need to use the Box Test, but we could perform the test using the Real Statistics MANOVA data analysis tool, this time choosing the Box Test option (see Figure 1 of Real Statistics Manova Support). The output is shown in Figure 8.
Figure 8 – Box’s Test
Since the p-value for the Box Test is .715, which is far higher than the commonly used value of α = .001, we conclude there is no evidence that the covariance matrices are significantly unequal.
Collinearity: MANOVA extends ANOVA when multiple dependent variables need to be analyzed. It is especially useful when these dependent variables are correlated, but it is also important that the correlations not be too high (i.e. greater than .9) since, as in univariate case, collinearity results in instability of the model.
The correlation matrix for the data in Example 1 of Manova Basic Concepts is given in range R29:T31 of Figure 2 of Real Statistics Manova Support. We see that none of the off-diagonal values are greater than .9 (or less than -.9) and so we don’t have any problems with collinearity.