Multivariate statistics employs vectors of statistics (mean, variance, etc.), which can be considered an extension of the descriptive statistics described in univariate Descriptive Statistics.

**Definition 1**: Given *k* random variables *x*_{1}, …, *x _{k} *and a sample of size

*n*for each variable

*x*of the form

_{j}*x*

_{ij}, …,

*x*. We can define the

_{nj}*k*× 1 column vector

*X*(also known as a

**random vector**) as

(also written more simply as *X* = [*x _{j}*]) and then define the

**sample mean**(

**vector**) of

*X*to be

and similarly for the sample variance, standard deviation and other statistics. Also if the *μ _{j}* are the population means of the

*x*then the

_{j}**population mean**(

**vector**)

**of**

*X*is defined to be

and similarly for population variance, standard deviation, etc. We can also define row vectors versions of these.

**Example 1**: Figure 1 shows the following statistics for each of the EU countries: gross national product (GDP) per capita (measured in the purchasing power parity with thousands of US dollars), accumulated public debt (as a percentage of GDP), current annual public deficit (as a percentage of GDP), current annual inflation rate and percentage of the population that is unemployed. Find the sample mean vector.

**Figure 1 – ****Data for Example 1**

The sample mean row vector (range B32:F32) is [29.8, 61.2, -6.3, 2.1, 9.6], and similarly for variance and standard deviation. We can also look at column vector versions of these statistics. E.g. the sample variance column vector is

**Definition 2**: Given a *k* × 1 column vector of random variables *X* = [*x _{j}*] and samples of size

*n*for each variable

*x*of the form

_{j}*x*

_{ij}, …,

*x*. We can define the

_{nj}*k*×

*k*

**sample**

**variance-covariance matrix**(or simply the

**sample covariance matrix**)

*S*as [

*s*] where

_{ij}*s*= cov(

_{ij}*x*). Since cov(

_{i}, x_{j}*x*) = var(

_{j}, x_{j}*x*) =

_{j}^{ }and cov(

*x*) = cov(

_{j}, x_{j}*x*), the covariance matrix is symmetric with the main diagonal consisting of the sample variances.

_{j}, x_{i}Similarly, we can define the **population** **variance-covariance matrix** (or simply the **population covariance matrix**) *Σ* as above where the covariances are population covariances.

The sample and population **correlation matrices** can be defined as [*r _{ij}*] where

Since

it follows that the main diagonal of this matrix consists only of 1’s.

**Observation**: By Property 0 of Least Squares in Multiple Regression, the sample covariance matrix can be expressed by the matrix equation

where *X̄* is the 1 × *k* row vector of sample means. Also the correlation matrix can be expressed as

where *D* = the 1 × *k* row vector of sample standard deviations.

**Example 2**: Calculate the sample covariance and correlation matrices for the data in Example 1.

**Figure 2 – ****Sample covariance and correlation matrices for Example 2**

Referring to both Figure 1 and 2, the sample covariance matrix is constructed by highlighting range H5:L9 (or any other 5 x 5 range) and entering the supplemental array formula =COV(B4:F30) or optionally the standard Excel formula

=MMULT(TRANSPOSE(B4:F30-B32:F32),B4:F30-B32:F32)/(COUNTA(A4:A30)**–**1)

The correlation matrix is constructed by highlighting the range N5:R9 and entering the formula

=COV(B4:F30)/MMULT(TRANSPOSE(B34:F34),B34:F34)

**Property 1**: If *λ*_{1}, …, *λ _{k}* are eigenvalues of

*S*then

Proof: By Property 1 of Eigenvalues and Eigenvectors, the trace of *S* equals the sum of the eigenvalues of *S*, but as we observed earlier, the elements on the diagonal of *S* are the variances, and so the sum of these variances is also equal to the trace of *S*.

Hi,

im looking to calculate the mean vector of X and X roof. I read it may have something to do with anova

Philip,

Is X roof the same thing as X hat (predicted value) or X bar (mean)?

Charles

Dr. Buenas tardes, ¿podría pensar en la aplicación de análisis de correspondencias, en el paquete de estadísticas reales?

Dr. Good evening, could you think in implementing Correspondence Analyisis, in real statistics pack?

Gerardo,

It is on my list of future enhancements.

Charles

Thank you very much

Hello,

I am trying to compare villages that are inside and outside protected areas. I have 30 variables (ordinal) from fieldwork equally collected for 236 villages ( 75 inside and 162 outside ).

Is there any test to compare these 2 groups, considering 30 variables??

Thank you very much

Sil

Hello Sil,

It sounds like Hotelling’s T-Square test. See the following webpage

http://www.real-statistics.com/multivariate-statistics/hotellings-t-square-statistic/hotellings-t-square-independent-samples/

Charles

Hi ,

I am looking for statistical method .Which will find significance or combined score using multiple attributes.Attributes are numerical in nature.

Sorry, but I don’t understand your questions.

Charles

Hi

I’m looking an appropriate test-statistic to compare tests results of 4 groups (Years) of an intelligence-test with overall results and results of factors (verbal/math/nonverbal). Every year particpicants with different sex and country took part.

Thanks for help

Manfred,

It does sound like a multivariate test, but I can’t tell which is the appropriate test statistic from the information that you have provided.

Charles

Hi

I need help to analysiz likert scale 1-7 for customer survey

What sort of help do you need?

I calculated the COVARIANCE matrix using the formula suggested by you. The values are different from yours (using COV). also, if use EXCEL’s COVAR to populate each cell of the Covraince matrix, the values are different.

Can you help me understand why these differences?

Excel’s COVAR function calculates the

populationcovariance of two sets, while COV calculates asamplecovariance matrix. You need to use COVP to calculate the population covariance matrix or COVARIANCE.S (in Excel 2010/2013) to calculate a sample covariance. You can also use COVAR(R1,R2)*n/(n-1) to calculate the sample covariance where n = COUNT(R1) = COUNT(R2).The formula suggested by me only works properly if there is no missing data in any of the cells. If there is some missing data =COV(R1) will ignore any row which contains missing data. If this isn’t what you want you may prefer to use the formula =COV(R1,FALSE). See the webpage http://www.real-statistics.com/multiple-regression/least-squares-method-multiple-regression/ for more information about this.

If none of this helps, then if you like please send me an Excel worksheet with an example of where the calculations don’t come out correctly. I will look at it and try to figure what the problem is.

Charles