# Principal Component Analysis

Principal component analysis is a statistical technique that is used to analyze the interrelationships among a large number of variables and to explain these variables in terms of a smaller number of variables, called principal components, with a minimum loss of information.

Definition 1: Let X = [xi] be any k × 1 random vector. We now define a k × 1 vector Y = [yi], where for each i the ith principal component of X is

for some regression coefficients βij. Since each yi is a linear combination of the xj, Y is a random vector.

Now define the k × k coefficient matrix β = [βij] whose rows are the 1 × k vectors $\beta^T X$ = [βij]. Thus,

yi = $\beta_i^T X$                     $\beta^T X$

For reasons that will be become apparent shortly, we choose to view the rows of β as column vectors βi, and so the rows themselves are the transpose $\beta_i^T X$.

Observation:  Let Σ = [σij] be the k × k population covariance matrix for X. Then the covariance matrix for Y is given by

ΣY = βΣ β

i.e. population variances and covariances of the yi are given by

Observation: Our objective is to choose values for the regression coefficients βij so as to maximize var(yi) subject to the constraint that cov(yi, yj) = 0 for all i ≠ j. We find such coefficients βij using the Spectral Decomposition Theorem (Theorem 1 of Linear Algebra Background). Since the covariance matrix is symmetric, by Theorem 1 of Symmetric Matrices, it follows that

Σ = β D βT

where β is a k × k matrix whose columns are unit eigenvectors β1, …, βk corresponding to the eigenvalues λ1…, λk of Σ and D is the k × k diagonal matrix whose main diagonal consists of λ1…, λk. Alternatively, the spectral theorem can be expressed as

Property 1: If λ1 ≥ … ≥ λk are the eigenvalues of Σ with corresponding unit eigenvectors β1, …, βk, then

and furthermore, for all i and j ≠ i

var(yi) = λi         cov(yi, yj) = 0

Proof: The first statement results from Theorem 1 Symmetric Matrices as explained above. Since the column vectors βj are orthonormal, βi · βj$\beta_i^T \beta_j$ = 0 if j ≠ i and $\beta_i^T \beta_j$ = 1 if j = i. Thus

Property 2:

Proof: By definition of the covariance matrix, the main diagonal of Σ contains the values $\sigma_1^2$, …, $\sigma_k^2$, and so trace(Σ) = $\sum_{i=1}^k \sigma_i^2$. But by Property 1 of Eigenvalues and Eigenvectors, trace(Σ) = $\sum_{i=1}^k \lambda_i$.

Observation: Thus the total variance $\sum_{i=1}^k \sigma_i^2$ for X can be expressed as trace(Σ) = $\sum_{i=1}^k \lambda_i$, but by Property 1, this is also the total variance for Y.

Thus the portion of the total variance (of X or Y) explained by the ith principal component yi is λi/$\sum_{i=1}^k \lambda_i$. Assuming that λ1 ≥ … ≥ λk the portion of the total variance explained by the first m principal components is therefore $\sum_{i=1}^m \lambda_i$ / $\sum_{i=1}^k \lambda_i$.

Our goal is to find a reduced number of principal components that can explain most of the total variance, i.e. we seek a value of m that is as low as possible but such that the ratio $\sum_{i=1}^m \lambda_i$  / $\sum_{i=1}^k \lambda_i$ is close to 1.

Observation: Since the population covariance Σ is unknown, we will use the sample covariance matrix S as an estimate and proceed as above using S in place of Σ. Recall that S is given by the formula:

where we now consider X = [xij] to be a k × n matrix such that for each i, {xij: 1 ≤ j ≤ n} is a random sample for random variable xi. Since the sample covariance matrix is symmetric, there is a similar spectral decomposition

where the Bj = [bij] are the unit eigenvectors of S corresponding to the eigenvalues λj of S (actually this is a bit of an abuse of notation since these λj are not the same as the eigenvalues of Σ).

We now use bij as the regression coefficients and so have

and as above, for all i and j ≠ i

var(yi) = λi         cov(yi, yj) = 0

As before, assuming that λ1 ≥ … ≥ λk, we want to find a value of m so that $\sum_{i=1}^m \lambda_i$ explains as much of the total variance as possible. In this way we reduce the number of principal components needed to explain most of the variance.

Example 1: The school system of a major city wanted to determine the characteristics of a great teacher, and so they asked 120 students to rate the importance of each of the following 9 criteria using a Likert scale of 1 to 10 with 10 representing that a particular characteristic is extremely important and 1 representing that the characteristic is not important.

1. Setting high expectations for the students
2. Entertaining
3. Able to communicate effectvely
4. Having expertise in their subject
5. Able to motivate
6. Caring
7. Charismatic
8. Having a passion for teaching
9. Friendly and easy-going

Figure 1 shows the scores from the first 10 students in the sample and Figure 2 shows some descriptive statistics about the entire 120 person sample.

Figure 1 – Teacher evaluation scores

Figure 2 – Descriptive statistics for teacher evaluations

The sample covariance matrix S is shown in Figure 3 and can be calculated directly as

=MMULT(TRANSPOSE(B4:J123-B126:J126),B4:J123-B126;J126)/(COUNT(B4:B123)-1)

Here B4:J123 is the range containing all the evaluation scores and B126:J126 is the range containing the means for each criterion. Alternatively we can simply use the Real Statistics formula COV(B4:J123) to produce the same result.

Figure 3 – Covariance Matrix

In practice, we usually prefer to standardize the sample scores. This will make the weights of the nine criteria equal. This is equivalent to using the correlation matrix. Let R = [rij] where rij is the correlation between xi and xj, i.e.

The sample correlation matrix R is shown in Figure 4 and can be calculated directly as

=MMULT(TRANSPOSE((B4:J123-B126:J126)/B127:J127),(B4:J123-B126:J126)/B127:J127)/(COUNT(B4:B123)-1)

Here B127:J127 is the range containing the standard deviations for each criterion. Alternatively we can simply use the Real Statistics function CORR(B4:J123) to produce the same result.

Figure 4 – Correlation Matrix

Note that all the values on the main diagonal are 1, as we would expect since the variances have been standardized. We next calculate the eigenvalues for the correlation matrix using the Real Statistics eVECTORS(M4:U12) formula, as described in Linear Algebra Background. The result appears in range M18:U27 of Figure 5.

Figure 5 – Eigenvalues and eigenvectors of the correlation matrix

The first row in Figure 5 contains the eigenvalues for the correlation matrix in Figure 4. Below each eigenvalue is a corresponding unit eigenvector. E.g. the largest eigenvalue is λ1 = 2.880437. Corresponding to this eigenvalue is the 9 × 1 column eigenvector B1 whose elements are 0.108673, -0.41156, etc.

As we described above, coefficients of the eigenvectors serve as the regression coefficients of the 9 principal components. For example the first principal component can be expressed by

i.e.

Thus for any set of scores (for the xj) you can calculate each of the corresponding principal components. Keep in mind that you need to standardize the values of the xj first since this is how the correlation matrix was obtained. For the first sample (row 4 of Figure 1), we can calculate the nine principal components using the matrix equation Y = BX′ as shown in Figure 6.

Figure 6 – Calculation of PC1 for first sample

Here B (range AI61:AQ69) is the set of eigenvectors from Figure 5, X (range AS61:AS69) is simply the transpose of row 4 from Figure 1, X′ (range AU61:AU69) standardizes the scores in X (e.g. cell AU61 contains the formula =STANDARDIZE(AS61, B126, B127), referring to Figure 2) and Y (range AW61:AW69) is calculated by the formula =MMULT(TRANSPOSE(AI61:AQ69),AU61:AU69). Thus the principal components values corresponding to the first sample are 0.782502 (PC1), -1.9758 (PC2), etc.

As observed previously, the total variance for the nine random variables is 9 (since the variance was standardized to 1 in the correlation matrix), which is, as expected, equal to the sum of the nine eigenvalues listed in Figure 5. In fact, in Figure 7 we list the eigenvalues in decreasing order and show the percentage of the total variance accounted for by that eigenvalue.

Figure 7 – Variance accounted for by each eigenvalue

The values in column M are simply the eigenvalues listed in the first row of Figure 5, with cell M41 containing the formula =SUM(M32:M40) and producing the value 9 as expected. Each cell in column N contains the percentage of the variance accounted for by the corresponding eigenvalue. E.g. cell N32 contains the formula =M32/M41, and so we see that 32% of the total variance is accounted for by the largest eigenvalue. Column O simply contains the cumulative weights, and so we see that the first four eigenvalues accounts for 72.3% of the variance.

Using Excel’s charting capability, we can plot the values in column N of Figure 7 to obtain a graphical representation, called a scree plot.

Figure 8 – Scree Plot

We decide to retain the first four eigenvalues, which explain 72.3% of the variance. In section Basic Concepts of Factor Analysis we will explain in more detail how to determine how many eigenvalues to retain. The portion of the Figure 5 that refers to these eigenvalues is shown in Figure 9. Since all but the Expect value for PC1 is negative, we first decide to negate all the values. This is not a problem since the negative of a unit eigenvector is also a unit eigenvector.

Figure 9 – Principal component coefficients (Reduced Model)

Those values that are sufficiently large, i.e. the values that show a high correlation between the principal components and the (standardized) original variables, are highlighted. We use a threshold of ±0.4 for this purpose.

This is done by highlighting the range R32:U40 and selecting Home > Styles|Conditional Formatting and then choosing Highlight Cell Rules > Greater Than and inserting the value .4 and then selecting Home > Styles|Conditional Formatting and then choosing Highlight Cell Rules > Less Than and inserting the value -.4.

Note that Entertainment, Communications, Charisma and Passion are highly correlated with PC1, Motivation and Caring are highly correlated with PC3 and Expertise is highly correlated with PC4. Also Expectation is highly positively correlated with PC2 while Friendly is negatively correlated with PC2.

Ideally we would like to see that each variable is highly correlated with only one principal component. As we can see form Figure 9, this is the case in our example. Usually this is not the case, however, and we will show what to do about this in the Basic Concepts of Factor Analysis when we discuss rotation in Factor Analysis.

In our analysis we retain 4 of the 9 principal factors. As noted previously, each of the principal components can be calculated by

i.e. Y= BTX′, where Y is a k × 1 vector of principal components, B is a k x k matrix (whose columns are the unit eigenvectors) and X′ is a k × 1 vector of the standardized scores for the original variables.

If we retain only m principal components, then Y = BTX where Y is an m × 1 vector, B is a k × m matrix (consisting of the m unit eigenvectors corresponding to the m largest eigenvalues) and X′ is the k  × 1 vector of standardized scores as before. The interesting thing is that if Y is known we can calculate estimates for standardized values for X using the fact that X′ = BBTX’ = B(BTX′) = BY (since B is an orthogonal matrix, and so, BBT = I). From X′ it is then easy to calculate X.

Figure 10 – Estimate of original scores using reduced model

In Figure 10 we show how this is done using the four principal components that we calculated from the first sample in Figure 6. B (range AN74;AQ82) is the reduced set of coefficients (Figure 9), Y (range AS74:AS77) are the principal components as calculated in Figure 6, X′ are the estimated standardized values for the first sample (range AU74:AU82) using the formula =MMULT(AN74:AQ82,AS74:AS77) and finally X are the estimated scores in the first sample (range AW74:AW82) using the formula =AU74:AU82*TRANSPOSE(B127:J127)+TRANSPOSE(B126:J126).

As you can see the values for X in Figure 10 are similar, but not exactly the same as the values for X in Figure 6, demonstrating both the effectiveness as well as the limitations of the reduced principal component model (at least for this sample data).

### 159 Responses to Principal Component Analysis

1. FRED BULUMA says:

This is just a wonderful explanation that has made me to understand how to use factor analysis. I will be grateful if more can be shared. You are a great statistician.

• Charles says:

Thanks Fred for your kind remarks. I plan to keep adding more to the website and software tools. Charles

2. Thomas says:

very useful indeed – much appreciated. Any chance you could upload the xls file?

3. emmel.wal says:

I cannot than you enough. I split my hair for the past couple of hours to get a clear, clean, succinct explanation for PCA. This is by far the best step-by-step example I have ever found. It is simply great! what a joy to learn from a great master.

• Charles says:

Thank you for your comment. I am very pleased that you liked the explanation. Charles.

4. Desperate Student says:

Dear Charles

I am trying to use the function CORR in the context of PCA analysis (for my msc thesis) but it is not working properly. I suppose I am doing some kind of silly mistake. I have three rows and 10557 columns (and some missing values). I select 10557*10557 blank cells and insert the corr function and press ctrl + shift + enter. Excel is not able to compute such a large matrix. I tried to filter the data to exclude all columns which include at least one missing value (and to decrease the number of data points) and, in this case, Excel is able to compute the matrix but returns many N/A values. Do you have any idea of what is happening?

Best Regards

• Charles says:

Hi Romulo,
Do the columns correspond to variables and rows to subjects?
I don’t know why you are receiving N/A values. Can you send me the Excel worksheet so that i can see what is going on?
Charles

• Desperate Student says:

Just solved the problem! I wanted a 3*3 matrix, actually. Silly me. I computed the first principal component for each country-period. Would it make sense to compute the second and third principal components for each row and then sum the three principal components weighted by their relative importance in explaining the standardized variance? So if the first eigenvalue is 1.5, the second eigenvalue 0.5 and the third eigenvalue 0.3, I would define the weights as 1.5/(1.5+0.5+0.3) and 0.5/(1.5+0.5+0.3) and 0.3/(1.5+0.5+0.3). I basically want to use the PCs as inputs in a panel regression that yields the maximum R-squared.

Best Regards

• Charles says:

Great. Good to hear that everything worked fine. Charles

5. genkumag says:

Thank you for this. Learned a lot from this post. 🙂

6. Steve says:

Hi Dr. Zaiontz – I really enjoyed walking through this example. One question: which of the nine teacher characteristics are most representative for determining a great teacher?

• Charles says:

Steve,
Based on the sample data, Entertaining is the characteristics with the highest rating, but this is a made up example and so please don’t draw any conclusions from the results presented.
Charles

7. Chris says:

Once you have the principal components how do you make use of them in finding useful correlations? I have seen people use scatterplots for PC1 and PC2 and also plot the original variadles against the PCs.

8. mahesh says:

Dear sir

you explain it in very simplified manner, step by step. great work, sir.
I was wondering if you can explain or just give some link where I can find how to calculate hedging ratio taking all PCs in account.
thanks, its really fun learning this stuff.

• Charles says:

Mahesh,

I have found the following links that could be useful:

Charles

9. Dieter says:

Dear Charles

Thank you for this fantastic resource – your explanations and the excel examples are extremely useful.

I’m looking to run a PCA on a set of data that has a time dimension. I’m using economic data for different countries across several years (e.g. GDP, population, interest rates… from 2000 to 2010 for a list of countries).

I wanted to ask how to account for this in your Excel model? I suppose the analogy to your example would be having the teacher survey across multiple years. Is it possible to find Principal Components for this or will I need to run a PCA for each year separately?

• Charles says:

Dear Dieter,
As usual it all depends on what you are trying to accomplish. You can ignore “year” in the same way as you are probably ignoring “country”, but I am not sure this will accomplish what you want. I am not familar with a time sequenced version of PCA (if even such a thing exists). Sorry but I am afraid I don’t have further insights here.
Charles

10. Curious says:

Hello, Dr. Zaiontz. Let me first express how helpful your Excel tutorials have been to my research. I’m not a statistics major, but I was able to grasp the concept of PCA little by little through your examples.

My question is on how we conclude the PCA. Our objective was to reduce the number of variables from 9 to 4. Looking at Figure 10, there are four principal components (listed under Y). Am I correct in saying that there are now 4 significant variables that explain 72.3%? Among the nine original variables, which of those are the 4 principal components?

Also, you showed us how to compute Y from sample 1. Does using 1 sample suffice or do we need to compute Y for all samples (100+)?

Thanks a lot!

• Charles says:

The four variables explain 72.3% of the variance. These variances are not “significant” in any statistical sense. There is no measure that I know of that says x% is significant and less is not.

The four variables are not among the 9 orginal variables. They are linear combinations of the 9 original variables.

One sample is all you need to perform principal component analysis. Of course the more data you have in this sample the better the results will generally be.

Charles

11. DJ says:

After finding the eigen values/vectors, you identified the high correlation values with a threshold of ±0.4.

What is the guidance for the threshold value to use? I.e. why did you pick 0.4 instead of some other value?

Thanks for your excellent website and information!

• Charles says:

DJ,
Glad you like the website. The 0.4 threshold is somewhat arbitrary, although commonly used. It is supposed to represent a value which shows that the two variables have a sufficient level of correlation. You can use another value, but the goal is to partition the orginal variables based on the factors: variables that are sufficiently close to a factor (preferably one factor), based on the .4 criterion, are in some sense represented by that factor.
Charles

12. R_Analyst says:

Dear Charles,
It is super good article on the subject. I came to know, how critical analysis can also be done on XLS. Here one question made me uncertain on establishing the data matrix, is, what need to be considered as columns? Here, you put ‘Variables’ as columns and ‘Observations/Samples’ as rows. Someother example, put vice-versa. So, Does it make any difference in analysis due to the vice-versa case? How it makes difference, and how to decide the correct pick here?
Pls help 🙂

• Charles says:

Most of the time I use ‘Variables’ as columns and ‘Observations/Samples’ as rows. This is what I have done for PCA.
Charles

• R_Analyst says:

Dear Charles,

many thanks for your note. I am working on an example and find the below loadings(x) for that object x.
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
a1 0.995
a2 -0.902 -0.391 0.171
a3 -0.241 -0.367 0.178 -0.881
r1 -0.150 0.916 0.219 -0.295
m1 -0.320 -0.154 0.875 0.328

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
Proportion Var 0.2 0.2 0.2 0.2 0.2
Cumulative Var 0.2 0.4 0.6 0.8 1.0

I am unable how to understand these loading in first block. there all -ve values for Comp1, two are -ve for Comp2. How to interpret these results? How the -ve / +ve orientation of loadings to interpreted? Overall summary, what these Loadings reveal about the sample data I had modelled via PCA.

• R_Analyst says:

Pls find below correct aligned values:
…………..Comp.1.Comp.2.Comp.3.Comp.4.Comp.5
a1…………………………………………………………….0.995
a2……-0.902……..-0.391..0.171…….
a3…………-0.241.-0.367..0.178.-0.881…….
r1…………-0.150..0.916..0.219.-0.295…….
m1…………-0.320.-0.154..0.875..0.328..

……………Comp.1.Comp.2.Comp.3.Comp.4.Comp.5
Proportion.Var….0.2….0.2….0.2….0.2….0.2
Cumulative.Var….0.2….0.4….0.6….0.8….1.0

• Charles says:

I don’t understand the info you supplied. If you send me an Excel spreadsheet with the input data and results, I will try to answer your question.
Charles

• R_Analyst says:

Dear Charles,
I had sent the sample model and inputs & my analysis results to the email noted under ‘contact us’. Pls look into them and help me.

13. Md.Khorshed Hossain says:

Thanks for a post that appears easy to understand to us , though its not so easy.
I have analyzed Risk factor through spss 17 version , 15 variables was considered and using PCA method with help of Anti-image matrix and Rotation Matrix i found 5 components or factor. These Five factor contributed 80.692% of Eigen values and i have nothing problem with result and model fit and interpretation , i want to use a mathmatical equation or model to represent also , would you please inform me about how to write a equation relating variables ? Please give an example ………….

Thanks with Best regards
Khorshed

• Charles says:

I am not sure I completely understand what you mean by an equation which relates the variables. In any case such equations are already described on the referenced webpage (at least based on my interpretation of an equation which relates the variables).
Charles

14. humza says:

evector function does not work in excel 2007.i have installed add-in.how can i find evector function

• Charles says:

This function has worked in the past. What was the error that you found?
Charles

• Bram says:

Dear Charles,

I seem to have a similar problem. I have installed the Add-In in Excel 2010 following the protocol on the website.

When I try to use the eVECTORS formula on a 9×9 correlation matrix, I only get 1 value in return. Not the 9×10 matrix as is described in the example above.

Have you heard about this problem? Is it something I do wrong (use the formula wrong for example)?

I hope you’ll be able to help.

Best regards!

• Charles says:

Dear Bram,
Since eVECTORS is what Excel calls an array formula, you need to highlight a 9 x 10 range enter the formula and press Ctrl-Shft-Enter (i.e. hold down the Control and Shift keys and press the Enter key). If you don’t highlight the proper size range or only press the Enter key you won’t get the correct answer. See Array Functions and Formulas for more information about such functions.
You can also use the Matrix Operations data analysis tool to produce the eigenvalues and eigenvectors in a simpler manner.
Charles

15. daniel says:

Sorry to tell you, but there is a mistake in the notation. How can I calculate a k x k covariance matrix of X, if X has the dimension k x 1?

• Charles says:

Daniel,

At the beginning of the webpage X is a k x 1 vector of random variables. Later on the page I redefine X to be a k x n matrix (based on a random sample of size n for each random variable x_i. The covariance matrix refers to this X. Admittedly there is some abuse of notation here, but otherwise the notation would get pretty complicated looking.

See http://www.real-statistics.com/multivariate-statistics/descriptive-multivariate-statistics/ for how I use the notation.

I will think about how to make this clearer in the future.

Charles

16. Alexandre says:

17. Seb says:

Thank you Charles – this has been monumental.

18. yokeshbabu says:

• Charles says:

Charles

19. Niraj says:

Hi,
A very useful and clear paper. My compliments.
1 In the calculation of Principal components calculation for the y range is given as MMULT(TRANSPOSE(AI61:AQ69),AU61:AU69). Is there an error here? Should it be MMULT(TRANSPOSE(AI61:AI69),AU61:AU69). A 9*9 matrix cannot be multiplied by a 9*1 matrix.
2 When I use the function MMULT as suggested by me the answer is shown in the MMULT dialog box but doe not get transferred to the right cell on the excel sheet. I have to read it and key the answer in manually.

One question:
In figure 5 the Eigen values in row M18 to U18 for the PCA. Is it correct to conclude that the Eigen values refer to the 9 attributes as shown below:
Expect 2.88
Entertain 1.43
Comm 1.16
Expert 1.02
Motivate 0.705
Caring 0.647
Charisma 0.56
Passion 0.34
Friendly 0.23
I ask this because somewhere it says the Eigen value table is arranged in descending order – which need not be the same as the order of characteristics of teachers in table 1.
Regards and thanks
Niraj

• Charles says:

Niraj,
I don’t see the problem. If I multiply a 9×9 matrix A by a 9×1 matrix B I get a 9×1 matrix AB.
Charles

20. Specious says:

Thanks to share with us your skills abot PCA.
Is possible to test maximum eigenvalues of covariance matrices in large data sets? If yes,
Which statistical test i can use ?

• Charles says:

Specious,
Sorry, but what are you trying to test the largest eigenvalue for?
Charles

21. Chuck Blumentritt says:

In your description above, you have the formula
{Sigma} = [{Sigma}j=1 to k] of lambda Beta(j)
where in the second sigma is clearly a summation. What is the meaning of the first Sigma? Is it a summation or some variable or what?

• Charles says:

The first sigma is the population covariance matrix.
Charles

22. Carmine says:

Hi Charles,

thanks a lot for this. I just downloaded the Real Statistics Package. Sorry for the obvious question, but I would like to ask how can I obtain a correct 9×9 matrix when using the COV or CORR functions. I mean, I insert the formula (COV or CORR) (B4:J123) in the fist row/first column cell and I get the right figure. How can I expand this to the other cells of the 9×9 matrix and obtain the correct figures?

Carmine

• Charles says:

Carmine,

Suppose you want to place the 9×9 matrix in range L1:T9. Then highlight this range and insert the formula =COV(B4:J123) and press the Ctrl-Shift-Enter keys all together. If you have already placed the =COV(B4:J123) formula in cell L1 then you need to extend the range to L1:T9 and click on the formula bar where =COV(B4:J123) is visible and then press Ctrl-Shift-Enter.

See Array Functions and Formulas for more detail.

Charles

• Carmine says:

Thanks for this Charles. Perhaps, do you know the multivariate technique similar to PCA called ‘vector model for unfolding’?
I am struggling with it. It consists of calculating a vector model in p dimensions, which is equal to minimizing the sum of squared errors ¦¦E¦¦2 for a standardized matrix H(mxn, that is items x respondents) and the low-dimensional representation XA’:
Lvmu=¦¦H-XA’¦¦2

Where X is a mxp matrix of the object scores for the m rows of the first p components and A is a nxp matrix of component loadings. X is standardized to be orthogonal and the component loadings matrix A contains the correlations of the n respondents with p components X.

Do you know if the p components need to be calculated from the the covariance or correlation Matrix derived from H? perhaps, H standardized or not? And what are the object scores and the component loadings in this case? Sorry for the tedious question, I would greatly appreciate some help

Thanks anyway

Carmine

• Charles says:

Carmine,
Sorry, but I am not familiar with vector model for unfolding.
Charles

23. suhani says:

thanks a lot for this,
i want to know what is the different using principal component analysis and principal axis factoring? principal axis factoring is one of the extraction method from factor analysis right? but why some people often compare this two methods (PCA vs PAF)? can you help me
Thank You

• Charles says:

Principal Component Analysis is a type of analysis that is described on the referenced webpage. As you said, it is also a type of extraction method used with Factor Analysis, which causes some confusion, and some people also use the terms Principal Component Analysis and Factor Analysis interchangeably. Principal Axis Factor is another extraction method used with Factor Analysis.
Charles

24. Patrick Van Hove says:

Thank you very much for this very valuable resource. It’s very useful for understanding a little better how these calculations all work.

I’ve been trying to do a PCA, and I installed your resource package and your example file, and anytime I try to use the eVECTORS function when starting from a correlation matrix (or even in your example file, for that matter, so I don’t think it’s an issue with my correlation matrix…) I systematically get a #Value! error…

I’m using excel 2011 for mac, any pointers as to what might be causing this?

• Charles says:

If you send me an Excel worksheet with your data I will try to figure out what is causing the problem.
Charles

25. Brian says:

Thanks for the extremely helpful information and utilities. You have succeeded in explaining things clearly that I haven’t grasped in many stats classes. Question:

When I use “=cov” and “=corr”, rather than generating matrices, they give me just a one cell answer. I am properly referencing the source matrix of variables/observations, but it doesn’t generate a matrix like it does in your explanations on the web site. Should something like “=cov(Nutrition!D4:BB36)” in one cell generate a full matrix?

Thanks again for all your great material.

• Charles says:

Brian,
I am pleased that you find the tools to be useful and the explanations to be clear.
COV and CORR are array functions, and so you need to highlight a range sufficiently large to contain the output and then press Ctrl-Shift-Enter (instead of just Enter). This is explained on the webpage Array Formulas and Functions.
Charles

26. Wilmer Lin says:

I appreciate so much this explanation on PCA. It was the most useful and effective reading i’ve ever made on PCA. Thank you for writing it.

Where’s the scatter plots I’m used to seeing for the different subgroups of data? I want to do a PCA on 6 DNA mutations over the groups chimps, gorilla, orangutan, gibbons, old world monkeys, new world monkeys, and lemurs, with up to 100 members each. I want a plot of PC1 vs PC2 (and maybe PC3) with different symbols or colors for each of the groups above. Do I have to do all this manually after computing the PCs?

• Charles says:

Currently you need to do this manually. Of course, most of the work is done by Excel’s charting capability.
Charles

You should calculate the PC coordinates for each input data point and produce the common scatter plot. This is a lot of extra work for one who may not be a stat nut.

29. Iqbal says:

Thanks alot for the great explaination.

I have some question, How PCA can implementation on Flavor compounds?
I always read some papers about flavor on food. thats paper using PCA to describe th data,

thanks

• Charles says:

Sorry, but you would need to provide further information. I am not an expert of flavor in foods, and so cannot provide help on this topic.
Charles

30. Deepak says:

Hi,
I am getting an error while trying to calculate eigen vectors. The formula evectors returns an error “Compile Error in Hidden Module : Matrix”.

Deepak

• Charles says:

Deepak,

Quite a few people are having the same problem lately. In order to help you and the others with this problem, please answer the following questions:

1. What value to you get when you enter =VER() in any cell in a blank Excel spreadsheet?
2. Are you able to use any other data analysis tool? E.g. the Descriptive Statistics and Normality data analysis tool?
3. Which version of Windows are you using?
4. Does the version of Office 2007 that you are using also contain Access and/or Outlook?

Charles

• Charles says:

Deepak,
I have recently heard from others who are having problems specifically with Excel 2007 as part of Office 2007 Professional. I have also been given the suggestion that upgrading to the latest Office 2007 service pack (namely SP3) can resolve this type of problem, but I have not been able to test this myself.
Charles

• Deepak says:

Hi Charles,
The Excel version is 2007 – 4.2 and is part of the MSO Professional on Win8. Except eValues, which returned only a single value and not the output as described here, none of the other formulas that I have tried so far work all popping up the same error.
I will try it out on some other version and let you know what happens.

Deepak

31. aliaksei says:

Hi, Charles.
Could you please explain the case when some of the variables are highly correlated with not one principal component, but, say, with two, three. How do the calculations change in this case (I am talking about the very end of this article, when a treshold was chosen)? The article “Basic Concepts of Factor Analysis” that you refer to for such cases did not help me, since it does not contain a numerical example.
Thank you.

• Charles says:

The goal is to have most variables correlate with one principal component (not two or three). Unfortunately this doesn’t always happen. I can think of only a few solutions: (1) choose a different rotation, (2) eliminate the “offending” variable or (3) live with an less than ideal result.

Perhaps someone else in the community has another idea.

Charles

32. Niket Bhargava says:

warm greetings…
with due respect…
120 students to rate the importance of each of the following 9 criteria.

where and how i can find this data to perform this exercise in excel.

• Charles says:

Charles

33. Amelia Marsh says:

Dear Mr Charles,

I am trying to understand the Principal Component Analysis and your tutorial is really good and very very helpful. I need your guidance regarding –

(1) Can PCA be applied over a text?

The reason behind (1) is –

Assuming I have analysts reports regarding say 250 companies. I am aware that out of these 25 companies, 5 companies have defaulted. I have been asked to apply principal component analysis to each of these 25 companies to find out those words which if are occurring in say the 26th companies Analyst report, it will give me clear indication that this company will default. I do understand this is a vague question, but this is an assignment given to me in my office.

(2) Is it possible for you to share the data sheet about the students mentioned in Example 1, so that I also can try to actually calculate the values to understand PCA in a better way.

Regards and sincerely sorry for bothering you.

Regards

Amelia Marsh

• Charles says:

Hi Amelia,

(1) This is an interesting question, but PCA doesn’t seem to be the correct tool since it requires continuous data, which is ordered. Your data is not ordered. Correspondence Analysis seems like it might be a better fit for the problem. I will be adding this capability to the website and software shortly.

(2) You can download the worksheet for all the examples on the website by going to the webpage

Charles

34. anandan says:

Hi,
Can anyone help me to determine weights of criteria in a multi criteria decision making problem using principal component analysis

35. Boro Dega says:

I always see that when plotting the PC’s against each other it is always PC1 against PC2, PC2 against PC3 and so forth…my question is what will be the incentive of plotting PC1 against PC2, PC1 against PC3 so forth..why would I do that and if I do it would that be wrong?

• Charles says:

The only reason I can think of for doing that is to see more clearly whether there is great difference between PCn and PC1. Although it wouldn’t “wrong” to do this, I prefer to look at the usual scree plot to find the inflection point.
Charles

36. Rich says:

Happy New Year, Charles

Do you have a posting or discussion that addresses Principal Components Regression in general, and more specifically, using your suite of tools?

Thanks, Rich

• Charles says:

Rich,
Happy New Year to you too.
PCA is usually viewed as a special case of factor analysis. This is explained in detail throughout the Factor Analysis webpages. See Factor Analysis for links to all these pages.
The Real Statistics tools are described on the webpage Real Statistics Support for Factor Analysis.
Charles

37. Sam Preston says:

Hi Charles,

I commented yesterday on another page having used the Varimax function to give me 10 out of 11 original variables highly correlated with one principal component.

I went on to calculated new X values as shown in Figure 10. I apologise if this is a stupid question (I am new to statistics), but what is the next step? Should the principal components (Y column) be recalculated using the new set of X values?

Thanks again,

Sam

38. Ahmed says:

very nice explain
kindly Dr. Charles can you attach all data of any example which you explain
i need all data and i will try solve it by any other statistical software like SPSS

• Charles says:

Ahmed,
You can download all the data from the examples by going to
Examples Workbooks
Charles

39. Sebastián García says:

How can determine the weight of the variable using Factor Analysis?

• Charles says:

Charles

40. Thanh Thao says:

Dear Charles,

I dont understand that why dose the eigenvector 1 change the sign between figure 5 and figure 9 ?
in figure 5: 0.108673 in figure 9: -0.108673
-0.41156 0.411555
-0.44432 0.44432
… …

• Charles says:

If X is a unity eigenvector corresponding to eigenvalue c, then so is -X. I simply changed the sign of all the elements in that eigenvector to keep as many positive values as possible. You should get equivalent values even if you don’t do this.
Charles

41. MGL says:

Hi Charles,

I’m going to reference the Excel Spreadsheet File(s) that are provided as examples in this post. I have a few questions.

Real-Statistics-Multivariate-Examples.xls
Tab: PCA

I understand all of the math that is going on after slogging through where the numbers come from. However, in the Excel Sheet that I referenced above, I noticed something strange:

This has to do with the Reduced Model Cells, that have to with Matrix B (AN74:AQ82), Vector Y (AS74:AS77), Vector X’ (AU74:AU82) and Vector X(AW74:AW82)

The Question in particular has to do with the calculation for Vector X, which the formula in Excel is:

AU74:AU82*TRANSPOSE(B127:J127)+TRANSPOSE(B126:J126)

But as I parcel it out, there’s some strange things going on.

If I parcel out TRANSPOSE(B127:J127)+TRANSPOSE(B126:J126), that’s basically an addition of two vectors, the Mean and the Standard Deviation for the nine variables, for which I get values of:

4.974544736
8.736692957
5.099293202
4.397380333
7.664479602
7.377230536
7.431224993
5.677330181
7.888403337

So I put these values into a separate set of cells, let’s just say:

AY74:AY82

But when I do the math of:
AU74:AU82*AY74:AY82

I get:

-4.898239264
-0.660121779
-6.197503264
2.422801034
-2.007222372
-2.098727253
-6.777166555
-1.530919504
11.13149207

These numbers are a lot different than the calculation for the X Vector which is:

2.461544584
8.123596964
1.445506515
3.942037365
6.457931704
5.618899676
2.480028027
4.468199567
8.130308341

So my question is: what is going on here? This may seem like a stupid question, but I am curious as to the calculation that is going on behind the scenes, as it were.

• MGL says:

Oh, nevermind, order of operations. Silly me.

42. Muhamamd ( Mu Mu) says:

i am working on Oil Supply Risk assessment.I have 5 Indicators and further 11 subindicators .I want to assingn weights by PCA.can any one help to make me clear how to use PCA with example .I ll be thankful to you. plz send me at mosikhann@yahoo.com Thanks

• Charles says:

The referenced webpage describes how to do this in general. You haven’t supplied enough information for me to be able to guide you further.
Charles

43. Jesper Rindboel says:

Dear Charles,
Thank you very much for your tools, they have made my (work)life much easier.

What would be the significance of calculating the eigenvalues of the covariance matrix instead of standardizing the numbers twice (before and after) to work with the correlation matrix?

Regards,
Jesper

• Charles says:

Jesper,
I don’t know what happens if you use the alternative approach that you are describing.
Charles

44. Juan Perez says:

Hi Charles, thank you for the explanation.

However, I’m wondering if you could publish all data of the example to try to reproduce your analysis, and then I will feel confidence to do my own. Thank you again!

• Charles says:

Juan,
This info is already available. See
Examples Workbooks
Charles

• alba~math says:

I guess you are talking about the file Real-Statistics-Multivariate-Examples.xlsx Could you please tell which sheet is used in the example of this blogpost.

• Charles says:

Yes, it is in that file. You will find it in Principal Component Analysis sheet (in the Factor Analysis group of sheets)
Charles

45. steve says:

You example is fantastic, thank you

How would these values be used in a regression model? Assuming I want to find a forecast for n and have 9 variables above.

• Charles says:

Steve,
Sorry, but I don’t understand your question.
Charles

• steve says:

Thanks for the prompt reply 🙂

We have produced a model that has reduced our input variables. Can we use this to estimate a given variable not included? Or are we just evaluating the relationships between our input variable (i.e. we can’t predict values from our output).

Thanks again

• steve says:

I guess I am asking if I can use the output from the PCA model in a regression model

• Charles says:

Steve,
You can use the factor scores, as described in the other response I am providing to you.
Charles

• steve says:

That’s brilliant – I have worked through those examples

Once I have the factor scores the next step is to regress them against the dependent variable, is that correct?

• Charles says:

Yes

46. Michael says:

Greetings,

Your example is very helpful. I am curious how this might be applied to the development of indexes for an industry when the most precise data might be time-series NAIC (or SIC) codes from the Census Bureau since this significantly reduces the number of observations.

Best wishes,
Michael

• Charles says:

Michael,
Perhaps because I am not familiar with the NAIC and SIC codes, I don’t exactly understand your question. Can you please provide some background?
Charles

• Michael says:

Hi Charles,

NAIC’s are the North American Industry Classification system, while SIC’s are the Standard Industry Classification system. These are coding systems used to classify industries such as debt collection, telecommunications, law firms, etc. and correspond with a business’s tax filing code. I hope this helps.

Best wishes,
Michael

• Charles says:

Michael,
Thanks for explaining this, but I still don’t understand what it is you are looking for from me. I don’t have time to investigate NAIC and SIC to try to figure out how to develop indexes for an industry.
Charles

• Michael says:

Charles,

Perhaps my reference to NAIC and SIC codes as a frame of reference confused the question. I was only referencing these coding systems since they show aggregate time-series data for numerous industries.

Your above example appears to show a snapshot in time, while I would like to know how PCA would be applied to the development of an index with time-series data.

Once again, sorry for the confusion. I hope this better clarifies my question.

Best wishes,
Michael

• Charles says:

Sorry Michael, but I just have anything to add to what I said earlier.
Charles

• Michael says:

Charles,

Okay, I didn’t think asking about the application of time-series data to PCA was an unreasonable question. Thank you for trying to help.

Best,
Michael

47. David says:

Hi Charles,

Spotted this little typo:
=MMULT(TRANSPOSE(B4:J123-B126:J126),B4:J123-B126;J126)/(COUNT(B4:B123)-1)

Should be
=MMULT(TRANSPOSE(B4:J123-B126:J126),B4:J123-B126:J126)/(COUNT(B4:B123)-1)

Regards

• Charles says:

David,
Thanks for catching this hard-to-spot typo. I had to stare at the formula a couple of times before I found the error. I really appreciate your help in making the website better.
Charles

48. Prashant Dhamale says:

Dear Charles,
Your Example is very good but it has one very basic mistake.
The example you have considered is of ordinal data and Pearson’s correlation coefficient is used for cardinal data.
So i guess you have to use Spearman’s rank correlation to find out the correlation matrix.
Prashant

• Charles says:

Prashant,
It is common to treat Likert data as interval data. Generally, the larger the range of values, the more reasonable this assumption is. With a range from 1 to 10, I don’t see too much problem.
Charles

49. MSIS says:

Hi Charles,
Thanks for a clear explanation.
Let me see if I my understanding is correct:

1) The covariance matrix E is symmetric, so that it can be diagonalized

Where D_ij is the diagonal matrix with d_ii eigenvalues; A_ik is the matrix where
a_ij is the eigenvector associated with eigenvalue d_ii

2) We can change the basis representing E so that, in this new basis, Cov(x_i,x_j)=0
and Var(x_i)=d_ii

3) This is where I am a bit confused:
The components in PCA are then just the eigenvectors in this new basis satisfying the properties in 2.

Am I close?
Thanks.

• Charles says:

Although I am tempted to wade into the details of your interesting question, I am afraid that I really don’t have the time at present to do this. I am busy trying to complete the writing of a book that is long overdue.
Charles

50. Ray Rui says:

Hi Charles,
Thank you very much for sharing. I am just studying PCI. I downloaded your excel and read your text. It is very helpful. My question how I can weight for each original variables based on PCA. In your excel, I did not see weight based on PCA for each original variables, e.g. weight for expect, weight for communication variable. Is possible you can update excel or provide some suggestion how to get weight for each original variables from your existing excel example. Thanks.

Ray Rui

• Charles says:

Ray,
I think you are looking for the factor scores. Please see the following webpage:
Factor Scores
Charles

51. Sana says:

Just wanted to know, Is it possible somehow to get back co-orelation matrix from eigenvalue and Eigen vector matrix? Futher, can correlation be transformed back to original matix?

• Charles says:

Sana,

A correlation matrix cannot be transformed back to the original matrix. This fails even for a 1 x 1 correlation matrix, i.e. the correlation between any two variables. Suppose I have two samples and compute the correlation between them. Now suppose I take any linear transformation of one or both of these samples (e.g. I double all the elements in the first sample and add one to all the elements in the second sample. The correlation between these new samples will be the same as the correlation between the original samples.

Regarding your other question, not all eigenvalues can be obtained from some correlation matrix. So we need to ask the related question. If I have a collection of eigenvalues and eigenvectors for some correlation matrix, is it possible that these are also the eigenvalues and eigenvectors for some other correlation matrix? I don’t immediately know the answer to this question.

Charles

52. Parth says:

Dear Charles
I am not able to understand how we can calculate the Eigenvalue and eigenvector from the covariance matrix. could you please explain me by giving some example.
thanking you

53. Raed Ameen says:

kindly Dr. Charles, I have a questionnaire with three main dimensions and 44-structured items. My question is, can I used PCA three times, individually with each dimension. Because, the results will be not correlated if I use PCA for all items. And is that acceptable statically. Could you please, provide me some papers as an deviance for this case.
Regards,

Raed Ameen

• Charles says:

Raeg,
If there is limited correlation between items in the three dimensions, then three separate PCA’a seems reasonable. Sorry, but I don’t have any references.
Charles

54. Faith Scott says:

Good day, Sir Charles.

Thank you for your clear and concise explanation. Currently, I am having trouble with my thesis since it consists of correlated variables. I have a total of 11 variables. My concern now is after PCA or after I choose the variables that should be included in the model, how would I run multiple regression and see the effect of predictors? Thank you so much, sir!

55. Mark Perew says:

Dr. Zaiontz –

I’m doing a replication study of a complex model with multiple inputs. The survey questions map to different inputs for the model formula. For example, Questions 1, 2, and 3 map to input X of the model and Questions 4, 5, and 6 map to input Y. Would I perform separate covariant analyses on each group of questions to obtain the coefficient for the model, as opposed to comparing all of the questions?

Thank you

• Charles says:

Mark,
Sorry, but I don’t understand your question.
Charles

Dear Charles,
Many thanks for your page. You really simplified the problem and it became very easy to understand.
Now need how to get coordinates of variables to make plot.
Also how to do the same thing with rows.

A

• Charles says:

I don’t understand what sort of plot you are making and what you mean by a row.
Charles

Hi Dr
I mean need to know how to get variables coordinates for any plan (for example F1xF2)
Thanks

57. Diana Molina says:

I’m wondering if with real statistics you can get the “typical” graph of PCA the one that has the two main principal components and all the vectors. I have watched some videos and kind of read your post but I don´t find that graphic.

Thanks

• Charles says:

Diana,
Sorry, but no such graph is currently included.
Charles

58. Luis E says:

Dear Charles:
when I was learning Lotus1,2,3 several (actually a lot!) of years ago, my professor said that by 2010 scientists would not need anything except a spreadsheet management software to perform their daily statistics tasks. He missed the mark, but thanks to professionals like you, we are getting there. Your work is outstanding. For people like me, interested more in the practical sense of statistics rather than the mathematical theory being, but still liking enjoying “to crunch the numbers” by ourselves, your excel product is simply pure bliss, so easy to understand an use. Thank you so much, sincerely.

And a question: I am working with a matrix of observations from a nominal scale (1 to 5). I feel that PCA could give me some good results, but my variables are more than my samples. Is that an scenario allowed for PCA?

• Charles says:

Luis,
I also used Lotus 123 a hundred years ago (and Visicalc before that). I am glad that your like Real Statistics. I have tried to make the website understandable for people with a good mathematics background and those without. It is good to hear that at least sometimes I have succeeded.
I have two questions for you, which should help me better understand your question to me:
1. Are the scale elements 1 to 5, ordered (Likert scale)?
2. What do you mean when you say that your “variables are more than [your] samples”?
Charles

• Luis E says:

Hi Charles, thanks you very much for your prompt response.
You are right on the money: we are using a Likert scale were 1 is total disagreement and 5 total agreement.
This is for a team project on organizational behavior. I called “sample” to each one of the people surveyed and “variable” to each characteristic evaluated (perception on group morale, quality of communication, etc.). It is very similar to your example with teachers evaluation but you have 120 students and 9 characteristics. I have 25 people and 50 characteristics.
Thanks again for you kind attention

• Charles says:

Luis,
It looks pretty similar. The real difference is that the sample size is much smaller (120 vs. 25). It seems like you should be able to use PCA. I suggest that you try and see what happens.
Charles

• Luis E says:

Thank you so much Charles

59. Robert says:

eVectors function returns only 1 value instead of the expected table of values.

I’ve tried highlighting a table before adding values but same result. Tried evect function as well, same result. Help is appreciated

• Robert says:

Disregard previous, I have since solved that issue.

I am seeing a table of eigenvalues the same size as my matrix, where I understand it should be the size of my matrix + 1. My matrix is 11×11 so my return from the eVECTORS() function should be 12×11?

• Robert says:

Disregard, please delete this comment chain. I must be sleepy after lunch! I did not select an area large enough to display the full table. My apologies, I greatly appreciate your RealStatistics package and this writeup as well. It is thorough, understandable, and IMMENSELY helpful

• Charles says:

No problem, Robert. Glad that the site and software have been helpful to you.
Charles

60. Balnagendra says:

Dear Charles,
I have done PCA calculation inch-by-inch on teachers data with a mix of R, Excel and now RDBMS.
Here are the final results by sample sno.
I have two questions:
1. What after this. Which sample is the best fit.
Like in Heptathlon example who becomes the winner.
How do I calculate the final winner.
What if final results do not match PCA scores.
What I understood by PCA analysis it is telling me which attributes are important as per the student samples.
Hence the name Principal Component Analysis.
Can this tell me which student did the best analysis?

2. It is though a trivial question, why do I have to reduce 9 dimensions to 4 only.
Because with RDBMS coming into picture it takes no extra effort to calculate all 9 dimensions.

thanks and regards

Balnagendra

“sno” “pc1” “pc2” “pc3” “pc4” “pc5” “pc6” “pc7” “pc8” “pc9”
1 0.782502087334704 -1.96757592201719 0.23405809749101 -1.12370069530359 0.765679125793536 0.661425865567924 -0.222809638610116 -0.149636015110716 -0.566940520416496
2 -0.974039659053665 2.04359104443955 -1.23102878804303 0.897707252817376 0.62491758484155 -1.09293623842783 -0.25896093055637 -0.225691994152001 -0.0398918478123148
3 2.10935389975489 -1.13846368970928 -1.07593823321308 0.283099057955826 -0.454549023294147 0.48844080714382 -0.894717995156593 -1.13899026199429 0.332691886333957
4 -0.724542053938968 0.691249601217778 -1.30865737642341 1.10848710931945 0.421806648458918 0.54955379892904 0.360672871353102 0.878709245413913 -0.414999556544989
5 -2.05965764874651 0.67930546605803 -1.67250852628847 -0.442481799437531 0.441101216317619 -0.273201679728252 -0.500097018678376 0.271317803148488 0.225190072763112
6 2.43697948851031 0.503973196537053 -0.464668276745191 -0.248369826536429 0.152057372044889 -0.0799040720635874 -0.629526574819221 0.607031366208402 0.885317491232681
7 1.17245795137631 0.373731432198285 1.10867164120596 1.5678378018626 0.627519469278004 0.188683503372758 -0.07050766739135 0.132528422971828 0.0412494362260818
8 0.929875093449278 0.311040551625064 0.145002287998668 0.283938851724668 0.564514738830247 0.642120596407302 0.319321868315749 -0.199037953705316 -0.0323163030469737
9 1.66910346463562 -1.2212052055784 -1.28613633226678 0.871926188450568 0.70404050847328 0.265578633840202 0.221453746999601 -0.454372056267191 0.399659792858934
10 -0.198902559902836 -0.529886141564662 -0.615238857850917 -1.19210853004315 -0.410788410814714 1.51598714991609 0.300040704880264 0.575755240080053 0.15679992171981
11 -3.44923191481845 -1.29740802339576 -0.055992772070436 0.2457182327445 -1.65991556858923 -0.535506231103958 0.658015264886284 -0.95044986973395 0.000566072310586335
12 -2.81692063293946 -3.40384190887908 -0.893243510415138 -2.21141879957508 0.434597001132725 0.519758768406618 0.85773115672659 0.101264311365968 -0.158952025440208
13 3.59171215449185 1.89176724118918 0.309251533624552 0.148957624701782 1.04009291354419 0.495619063804899 -0.257887667064842 -0.470623791443016 -0.0260280774736903
14 2.43512039662592 0.120346557491314 -0.896549265542583 -0.910496053134617 -0.23413309260305 0.652773344323182 -1.68141586952921 -0.209797697616484 0.912309673333395
15 -1.53971851243715 -0.216742717298959 2.22568232786775 1.03142181778516 1.38951593065816 -0.471413970592574 -0.830745712084571 -1.61220862040483 0.222454783522403
16 2.24223480845739 2.35826977756215 1.2747368099665 -3.96720539683461 -0.466867078838757 0.121235298989979 -0.0232835231112048 -0.305793148606438 0.54092086078009
17 0.643588469992117 -0.802846033161245 -1.15972977997649 1.24077586872133 0.109661349223429 -0.968391519947697 -0.685678339025484 0.0119856104795104 0.0191784905652393
18 0.0605691510216728 0.440091501248074 -1.60061610404203 1.0351395426926 -0.586476218998342 -0.172542804522174 0.177496305442361 0.645297211995821 0.342240723425264
19 0.89355982634065 -2.91635609294914 2.24844424549618 -0.00602433157646132 -0.0462814393706587 -0.015883213471414 -1.12544685259491 1.10394020113906 -0.668139093324923
20 -2.57466224283692 0.958694578468226 -2.33723748181028 -0.282876078427233 0.212422390598862 1.23134839354597 -0.831918364183796 0.24837866691648 0.32331843221832
21 -0.484481289090662 -0.501745627013796 2.75613339654198 -1.44825549659657 0.0156583659430092 -1.16699814608865 1.70208049153513 -0.531149553790845 0.405445852502525
22 1.38777080666039 1.08048291166824 1.64908264825942 0.891736159284439 -0.00803677873848277 1.13661534154646 -0.247899574392479 -0.314678844686956 0.663010472845527
23 -1.30290228195755 0.342299709437455 -0.371719818041046 0.902592218013332 -0.644777635093048 -0.0279742544685612 0.463999031647619 -0.62658910190333 -0.555196726833023
24 0.213824122273244 -0.412684757562806 1.51531627755052 -0.583294864783411 -0.269411518958366 0.50295876036215 -0.690962661566869 0.479054580680195 -0.691970884957572
25 -0.155822279166174 -1.60976027334598 -0.711986808199222 -1.86536634672466 -0.883832552993952 -0.722560894639709 -1.24530535523036 -0.144664760165016 -0.115222738956481
26 -2.69877396564969 1.62963912731594 -0.514195752540654 -1.00411350043428 0.596577181041759 -0.0107220470446568 -0.642391522634704 0.237356164605707 0.121333338885615
27 0.146685187402298 -0.590693721046644 -0.304710177633694 0.405116975656278 1.48346586150245 -0.293097908513852 -0.283789887164248 -0.311081345626959 -0.120469023615086
28 -0.0186311770114081 -1.58572901206572 -0.503729654635307 -1.73606715154119 -0.988131869641573 0.133910914389329 -1.77753570166105 -0.261101208206812 -1.41584721982239
29 0.677288584895166 0.786255250303149 -0.837182955382588 -1.0384636819257 -0.594273812941059 -0.254082306924117 0.879639187633599 -0.163839757279767 0.811081335444439
30 0.257846401217108 -1.72184638340196 0.743177759412136 -0.835847135276094 0.56458354944428 -0.627480039376463 -0.692905401867567 0.133789578881429 -0.386793467886672
31 1.7980302783598 2.12058438154593 -1.47851476432776 2.17827064659105 -1.40164741209632 -0.196539645614289 0.526985594001873 0.115189945764083 -0.486649352781431
32 0.18785139181874 -0.95232113692873 -1.75829870416512 -0.238883929583808 0.787133265197832 -1.35733823181046 0.200841407511822 -0.0747419411673103 -0.657602221913508
33 2.07360763220394 1.06731079583539 -0.622374706345469 -0.407774107153335 0.0130842816396723 -0.475625652632808 0.0294714493567123 -1.79423920739281 0.229505409374093
34 -1.98702109952536 -0.781487999282249 1.8164494545129 1.34819381569373 -0.268261921997533 -0.0853394830349178 -0.617236656725877 0.106704070809775 -0.0672452426170835
35 -0.0842635785978546 -0.808546183457541 -3.08423962863902 -0.252280507032949 0.0275689047741305 0.0489928109792082 -0.00971143118565888 -0.174008227810538 -0.0363928558907684
36 1.1442444875801 -0.959543118874319 -0.118633559145614 0.258040522415929 0.989913450382754 -1.03561838547745 -0.266218359491416 -0.571758226928065 -0.0348342932688773
37 -0.168033597135758 -0.511057583518805 -1.41543111329488 1.32435271728601 0.568813272392035 -0.264159738394365 -0.563673659304422 0.207847685989173 0.204471100566793
38 -0.551533469623016 1.57020288197434 -0.0793959332825079 -0.168970931523401 -0.650107894486544 0.95831821409573 -0.618640959971633 0.030499789561141 -0.0439695162519376
39 -0.229378133545641 -2.30075545748959 -1.78155728331527 0.728597155915061 -0.0463930812216655 -0.156789720387239 -0.708489990330352 0.882542324550004 0.227296000174256
40 1.42985410475402 1.33176962888974 1.30276567831546 0.489155934453709 -1.62521812097704 -1.7684116429541 -0.254782304816342 0.363334625151276 -0.478314985300798
41 0.0883215545494497 0.406514274747451 -1.25471616362036 -0.0883729103999533 0.988423869629932 1.79165671185606 1.04948052056105 0.0388046618972064 0.0224461567151149
42 0.231451682540001 0.281247058707726 -0.450737778437919 -0.120979422058898 1.28590367125086 0.363684029402355 1.10214945127868 0.191373409712103 -0.617536883435618
43 -3.92216605893476 -0.268177898752344 0.667320473744558 0.272570937639709 0.369786539175288 0.369012364164966 -0.131476924442594 0.346695440849866 0.261396747586279
44 -2.87021236245319 -1.71832832417395 -1.42544714499528 -0.838536702017114 -0.0638544794969115 -1.04657609700902 0.757024945744024 0.0865551076163885 -0.244666174796757
45 1.33981963793272 -0.647189848428991 -0.409137730520909 -1.67577140876126 -1.24989812262005 0.730597403428801 1.01936766561098 0.435967295797924 0.0367180873288577
46 1.08154318733386 -0.411869334053834 0.171818513055551 -0.310876338380063 -1.58017697507288 0.332096061550745 0.192501807136242 0.950465745229624 -0.450344528221605
47 2.22762896652468 -0.823541368941669 0.270644191029781 0.0211088287707587 0.166958816944801 -1.7023652279678 0.131092000847784 -0.607954172915959 -0.0063915846033176
48 -1.58627384528261 -0.372205537138622 2.47198841066136 1.24084205972132 -0.16998007346411 -0.0262129043079015 0.0599622913142327 0.213067739159875 0.165263750178023
49 0.302942487499374 -1.79334318136824 -0.335308330993403 1.35414717000109 -1.37747723338239 1.66402393377897 0.479371247503363 -0.396339384233823 0.795817527894645
50 -0.114380623482645 -0.996159333985214 -0.309937318544135 -0.87559934845438 -0.489982545346594 0.89291333329671 -0.663295615292572 0.323819033342857 -0.093703463892344
51 2.08406894919563 -0.194944448035439 0.103429785037518 -1.50504583126602 1.86584997322646 -1.90888994801285 0.0449429921454713 0.369051552602944 0.33489865816388
52 0.586192736924877 -2.11837557207718 1.81038853141086 -0.348722174704127 1.57416328969978 0.242343745449966 -1.23525892881849 -0.430622759185666 0.059007209035886
53 -1.7312496120527 1.24684749439158 -0.505638041393865 0.751561910295576 0.174703724447116 -0.650865138289552 -0.510187721719107 0.827365115138844 0.241563790355518
54 1.88367761265041 -0.301098780139465 -0.728717868623291 0.676858225745716 -1.72552346838382 0.117692341569983 -1.48176513657788 -0.875922557719259 1.05886189038882
55 -1.85623164360752 -1.81508597141892 0.572121500271764 1.42649185098506 1.54337013466874 -0.0318954034831038 1.30396275544971 0.361075002770525 0.757940230520358
56 1.32826042171637 0.116288305581857 0.553412919275055 0.146710360435489 1.04484470721673 0.583408027484661 0.659646169293215 -0.291117607379775 -0.432994512323255
57 0.53200126906556 -1.59613174107101 -0.294627025082988 -1.12695505672218 -1.51632140097369 0.654751833975232 1.40530071016884 -0.827994992847099 -0.0664834052998958
58 0.986827484073165 0.252487340871292 -0.446584966123108 0.174436157940907 0.0680296920763895 0.582190261333006 0.301014167729795 -0.210550298325092 0.219837363172795
59 -0.293759853115406 1.11447429845809 -0.475599512933651 -0.901616510623175 0.448409542467289 0.506302991937359 0.923239546150446 0.134387081929769 -0.217086284873576
60 -0.775750024125502 2.38841387002543 -0.0820401269293542 -0.839476520513405 -0.873575037321256 -0.652697283724577 1.05473754024588 0.920077007720006 -0.170096949833044
61 0.983602942812922 0.434255373331827 0.397186634747342 0.53297002262973 0.407569524647829 1.54065791861275 -0.255772663718672 -0.341512015035766 -0.64256156016085
62 -0.466272825933541 0.890574743584806 0.208898149434242 -0.333495429807941 1.06470569982219 1.43106252792028 -0.0561503035681147 -0.24709318338277 -0.331913962774536
63 -1.36008750149343 -1.42380803430953 -0.158496387287846 0.291920869396202 0.0201355882141332 0.712044738189181 1.17568381848413 -0.0570817275505043 0.503250152075267
64 1.39849548262018 0.736778356867192 1.33474143678678 0.347008665890287 0.341246615999414 0.150170175222031 0.316211307059223 -0.0292771376991018 -1.06880928164447
65 -3.27699310975962 2.8012756991339 0.93791087669004 0.306026503615162 -0.481453411335939 0.3486787059166 1.31668859699411 0.295154071106104 -0.100302434247638
66 0.64232649542513 -0.419823853078341 -0.850575401267987 -0.0972045146511588 0.728469899548837 -0.399674330877993 0.30294582351751 -0.11085233414766 -1.03533848623228
67 -1.73145290299062 1.26278004248044 0.152745069806177 -0.150177315766561 0.661672242705847 -0.663156252662813 -0.527764064952673 0.426553350485825 -0.373077348302117
68 1.87190915336508 0.399420910053652 -0.253326716163435 -0.0673762498577913 -1.43426015488097 -1.5137470297627 0.392361457223208 -0.0136101874434081 0.261128907183673
69 -2.61510742192065 1.51452301880057 -1.21650605692545 -0.222106250093736 0.162254560317825 -0.0404964458478106 -0.581950993797413 0.664205542547379 0.0455952537912126
70 0.203255936568871 -1.05925846446136 1.58189333840599 -0.696467706175566 0.679020476345081 -0.44971055665251 -0.337723054218969 0.343745245525764 -0.00575993203505341
71 -0.021801372954227 1.9609776066865 1.09810556215742 -0.230319598387728 -0.0897414871503529 1.03908643248678 0.216441896048324 0.0835609967395733 0.207313150671162
72 -0.182020929498805 0.812046277417037 -0.465773687815212 -0.32584687619152 -0.336860750535663 0.76390167562555 -0.67636319224649 -0.297573144046306 -0.135302037627801
73 1.82221539873663 0.793709251846531 0.335793121867528 0.0516316812743605 0.751743882652009 0.0948923492414927 -1.02994749641495 0.812988128202511 -0.179618196030375
74 -1.37788303932927 1.19182749449442 -0.847869806519015 -0.183651137104699 -0.522643552935373 0.0772317039837854 -0.999897790765109 0.196239794124379 -0.208087932628693
75 0.34367157998388 -1.21699523564164 0.946378891099452 -0.925702375680916 0.235181327126021 -0.55170640490322 -0.313166569201198 0.358270514194641 -0.443985489567322
76 2.39825875557217 -0.244380715101231 0.536273108776288 1.95861814221126 0.535924865136199 -0.0529582459940821 0.888561991998981 0.208038924073265 0.648542503246447
77 -3.03906738954254 0.0675768260626627 1.50259020405933 -0.723990776624893 0.23280566912899 1.17787742482936 0.213895862531362 -0.231368672986167 0.0630276915251678
78 -1.66282685809815 -1.63625297492346 0.205798856303935 1.92732203225808 -0.851421957573354 -0.331172059972286 -0.77240226814329 0.258782976315668 -0.253089523215343
79 4.36984190486667 -0.0568563655901598 0.852841737103529 1.60649246239525 1.16058795402539 1.53232098283346 -0.128669818741051 0.611233277903941 0.271476655442864
80 0.320697470911491 1.02304328266084 0.348820410613012 0.255283230163401 0.44261329701287 -0.0237114919177082 0.812562425231134 0.336323847024025 0.398334135095004
81 -0.694311039013675 1.12390494496779 0.308990189151952 0.030387002665182 0.175779374585573 0.942474914696783 -0.980072092052048 -0.182468801132295 -0.238931160791058
82 -0.804503082580487 -0.5712165960357 -0.0628278284901218 -1.4125465937636 0.25113738548952 -1.55212857754132 1.04130589188133 1.18167891464153 0.61490459850558
83 1.4445271289398 0.603216592261146 -0.382633264096422 -0.0424182918152672 -0.330173506144046 0.581386504985726 0.959905415183951 -0.115698974594768 0.704500022187671
84 -2.94111397029711 -1.13430252615859 0.324252921469584 0.174227349849624 0.422887005090489 0.958267061669453 -0.345565066570191 -0.318295971462363 0.372600990139826
85 -0.206875990719565 1.25223581422437 0.0768303512452056 -0.555230593717643 -0.0128327119176477 0.00106832296763809 0.296777873040254 0.080894197217132 0.165597531585657
86 -2.4829052608278 1.74889290565075 0.560196650929664 -0.611724494420666 0.597419485459591 -0.494315076350343 -1.11096928776505 0.130560818721402 0.522790829472842
87 3.70307665924458 0.744682030745002 0.69875635947034 -0.190267260168453 -0.365757916772712 0.173528047315548 -0.649426838534595 0.375058897718391 0.322288616912358
88 1.41621019033472 -0.54768546738809 -0.643440253488655 1.00642490339757 0.139142221854676 -0.985842332041203 0.270673023802412 -0.0112803327532778 0.793542164055871
89 -0.46465107017322 -0.498658246286561 -1.00640349109147 1.24832384484038 1.2566440212794 -1.06447803011526 -0.10984763407794 0.426648897730843 0.291635351113138
90 -0.254992335242131 -1.25713567145826 -1.11455152947167 -1.47101643626497 0.128204923220761 0.49961068006098 0.538260589852041 0.519785979385944 0.133857976762299
91 1.75498312179499 0.68236281593287 1.11523628175052 1.41190222846069 -0.588718747060544 -0.423127587961104 -0.110233629819784 0.27926454973849 -0.490936209931876
92 0.570286821029532 -0.0252236696584646 0.134386574752884 -0.329797766289199 0.0825204929026459 0.00176108149685678 -1.11693481844424 -0.680366196960751 -1.20593551592051
93 1.4969589585503 -0.679338582955151 -2.12686027916239 0.690064769829575 -0.549530831450519 0.298068988985377 0.285424763706275 -0.123286198610643 0.304920422922298
94 1.38576866187807 -0.529762606307054 0.495807061570137 1.1056983152435 1.18475821289719 -0.908047175069025 0.350152610578046 0.0377819697749281 -0.401144392135705
95 1.8983832549648 -1.04715728546651 1.35609817963907 0.0268721831194352 -0.891049353387284 -0.320495048794934 -0.0792022604989433 0.371442585486107 0.839860679217552
96 0.970178682581127 0.735215876390345 -0.217686312790237 0.028348546490172 -0.174326144755986 0.654187078040097 0.584040978605995 0.104783056068217 -0.798661222607653
97 -0.70189415858097 0.993705825537388 -1.32359738979333 0.743474018584158 0.0494278249696885 -1.37235949457844 1.19360833305461 -1.44631973078363 -1.18771367198678
98 0.54563505074669 0.399033319060009 0.018607502613259 0.342557742246817 -0.0804199867380695 -0.748781156787795 -0.126217409923931 0.0989129089557064 -0.290394808049418
99 0.604384832268931 -0.239964469656789 -0.240758475607008 0.188262525900104 1.08526266328201 -0.293901664861131 0.375101360289908 -0.216230021896635 0.364193635399789
100 0.0417000135058645 -0.66121843651733 -2.37688859967716 1.0539956728261 0.342224777889609 1.04139322229625 -0.391038011420712 0.0425393589601393 -0.265376966438786
101 1.3561984342455 0.474334501411613 -0.0749198965598872 -0.177260263628211 -0.407731877839998 -0.872004634898164 0.757978416131568 0.181024653471893 -0.915563256190855
102 1.45780979921971 1.28225985430025 0.990282507537082 0.267382707423292 -0.5372865506075 0.208078987797393 0.634778253535913 0.157653839705681 -0.183468413353313
103 1.23683415379054 1.43261207806902 -0.352221527070085 -0.183881380071139 0.437654434283295 0.772901917017358 -0.503647499009746 1.24785640769426 -0.489743165241738
104 0.425429053804819 1.53456198533808 -1.00668269187429 -1.27912027374453 -0.00911475670536932 -2.01441149193064 -0.0768692613385173 -1.46668431887024 0.553001842105725
105 0.0741981493900638 -1.7644885010375 0.607919483849666 1.44853986769349 -1.01119645585928 0.312628642145785 0.252887763433209 -0.114141024836039 -0.386020192951613
106 -1.2358483555353 0.0277020646013124 0.668644681037374 -0.548021091158945 -0.0992438620427344 0.405654484042933 -0.473763958182745 0.974285176986209 0.233317851086377
107 -1.88448541828051 0.204705829967788 1.89041164576323 2.11134493590079 -1.63128553960942 -1.32904990865649 0.0117381889744871 1.13631130178713 0.464634362152564
108 1.35670126833675 0.694742570552708 -0.259672560485204 -0.809509754909632 -1.17597046234099 -0.0214880708617368 0.0429637684842218 -0.437361088328435 -0.255383872018753
109 -0.492375659026261 -0.988346954058369 0.596517617651298 0.487251927053358 -0.978009668989947 1.02418942649096 0.815459253791288 -0.196974542366058 0.340899487031226
110 -1.98553950665938 0.615580421648432 -0.105458612067979 1.03217215568609 0.276458846878327 0.697899166748805 0.743887619348529 -0.484029487565335 -0.351130442703477
111 -0.379177002807663 -1.30992889661815 0.256478884533105 -0.00124319529488121 -0.715772659737935 0.0415210779326787 1.47628215703316 -0.00242525445830089 -0.429613703358972
112 -2.45842461847092 0.464046179440294 0.397271612596596 -1.29733529619643 -0.347189090472369 -0.311171423869974 -0.876046786124609 1.02973639259545 0.488423707399656
113 -4.44398192540397 1.24387104602947 0.0773912488510693 -0.85334472237364 0.300301633392949 0.120562255224506 -0.979239908283112 -1.2577838500123 0.643072683684058
114 -2.38201075860626 0.0920859517924101 0.842971332838757 1.14014373553384 0.0194173326172247 -0.417695985587741 -1.62641403641563 -1.34870923859559 0.314658304194936
115 0.830625654450711 0.107149906923257 -0.673071791226227 -0.130827384010068 0.312021291744692 -0.320123878638598 0.779396677974048 0.0227761820854548 -0.131223943813128
116 -1.23239098536307 1.67059498664153 0.2032369035355 -0.186380913446877 -0.607211871167498 -1.23290183082026 -0.610354212984832 0.547251863963834 -0.214883770708037
117 -2.82420640508718 -0.742265332294461 0.701850090620756 1.51579081492323 -2.61414380934709 -0.112851397944364 0.194098536619072 -0.613203769485263 -0.0702244606433524
118 -1.93806257292465 1.33215604900803 2.0243256676804 -0.575694496291744 1.2777864248107 0.854863238492391 1.16062930232016 -1.16654976930714 -0.238072620067327
119 0.434518690909247 -1.47435669830487 0.638029697817827 -1.41888400333751 1.22840696901026 -0.919590600708764 1.14298556571343 0.410082579985962 0.763656189315114
120 2.39432410614097 -0.639946800268254 1.2838634032795 -2.00943802453156 -1.92731538158316 0.411153978504092 -0.225057540698097 -0.447525909487941 -0.0294705378851733

• Charles says:

Balnagendra,

1) I am not familiar with the example you have provided nor the Heptathlon example, and so I don’t have the proper context to answer your questions, although it is likely that these are not the type of questions that PCA is designed to answer.

As stated on the referenced webpage, Principal component analysis is a statistical technique that is used to analyze the interrelationships among a large number of variables and to explain these variables in terms of a smaller number of variables, called principal components, with a minimum loss of information.

While PCA does identify which components are most important, these components are not the same as the original variables, but are instead a linear combination of these variables.

2) While it is true that it doesn’t take more work to include all 9 dimensions, the main purpose of PCA is to reduce the number of dimensions, hopefully identifying hidden variables (the principal components) that capture the concepts in the more numerous original variables.

Charles

61. brian c says:

in defn 1 shouldnt Y = beta*X and not beta^T*X also where does beta^T*X=[beta(i,j)] come from none of this seems clear.

• Charles says:

Brian,
Since beta is a k x k matrix, it really doesn’t matter whether we view Y = beta x X or Y = beta^T x X, as long as we are consistent.
For any row i, beta_i^T is a k x 1 column vector whose values are the beta_ij values for the given i.
I agree that the notation is a bit confusing. It is probably easier to following things using the real example that is described on the webpage.
Charles

62. Omachi says:

Hi I get a single cell answer when I use eVectors function on a 8×8 correlation matrix. Why is this?

• Charles says:

Omachi,
eVectors is an array function and so you can’t simply press Enter when you use this function. See the following:
Array Functions and Formulas
Charles

63. Fred says:

Hello Charles,
I hope you are well! When you say:

“Observation: Our objective is to choose values for the regression coefficients βij so as to maximize var(yi) subject to the constraint that cov(yi, yj) = 0 for all i ≠ j. We find such coefficients βij using the Spectral Decomposition Theorem (Theorem 1 of Linear Algebra Background). Since the covariance matrix is symmetric, by Theorem 1 of Symmetric Matrices, it follows that

Σ = β D βT

where β is a k × k matrix whose columns are unit eigenvectors β1, …, βk corresponding to the eigenvalues λ1, …, λk of Σ and D is the k × k diagonal matrix whose main diagonal consists of λ1, …, λk. Alternatively, the spectral theorem can be …”

Do you mean Σy? Are we decomposing X’s covariance matrix or Y’s?

Sorry, I think I got confused about that part.

Fred

• Fred says:

Oooooops, sorry, it actually makes sense now…very nice… the betas cancel out and still the variance is maximised and the covariance minimised. This is actually one of those moments where maths are beautiful 🙂
Is there a geometric interpretation of the process? It is almost like using an eigenvector basis that captures more variance than the standard basis…
Once again, superb explanation!!!
Thanks Charles

64. Bea says:

Thanks for this! The breakdown through Excel helped me understand PCA a lot better.

I’ve followed the accompanying workbook and I think there might be an error in the multivariate workbook in the PCA tab, AS61 to AS69 as it just picks up the first row in the raw data. Am I right in saying so?

Cheers.

• Charles says:

Bea,
I am pleased that this webpage helped you understand PCA better.
The range AS61:AS69 is only intended to show the first row in the raw data, as explained in Figure 6 of the webpage.
Charles