In our Movie Reviews example, the movies are displayed along one or two axes. What is the meaning of moving along one of them, say from left to right? To answer this question, we have to dive a bit deeper in how PCA works.
Let’s plot Height versus Weight for 200 people:
Height-Weight Distribution
Intuitively, there’s a link between these two variables: the taller, the heavier, and vice versa. Put differently, there is a correlation between height and weight: some information they each gives is the same. In maximising the dispersion, PCA effectively substitutes correlated variables with a single one, which is a linear combination of them. This new variable becomes the new axis to show our data.
We can repeat the process with the information remaining, which gives a second axis, and so on. These axes are the principal components.
First Component
In our example, the first principal component (PC1) is made of the variables Height and Weight:
PC1 = A x Height + B x Weight
A and B are the factors used to define the first principal component. These factors, also called loadings, indicate its composition, but also how much the initial variables were correlated:
Loadings for PC1
7
Not synced yet
1
2
Factor
Variable Name
Principal Component1
Factor
Variable Name
Principal Component1
A
Height
0.71
B
Weight
0.71
No results from filter
Both Height and Weight participate in large amounts to the definition of PC1, showing a large correlation between them.
Note: the factors have the same value because the variables have been scaled to a “standard” form.
To an increase of Height and/or Weight corresponds an increase in the first Principal Component. So PC1 represents a general size of the persons. On a plot of the dataset using PC1 as horizontal axis, the samples on the left correspond to smaller and/or lighter persons, and on the right, to taller and/or heavier persons.
Second Component
With the remaining information, the second principal component (PC2) is calculated as:
PC2 = C x Height + D x Weight
PCA gives us C=-0.707 and D=0.707. Loadings can also be represented graphically:
Loadings for PC2
7
Not synced yet
As the loadings have the same length, Height and Weight participate in the same amount to the definition of PC2. But the signs of C and D are opposed: their correlation concerns cases of increase in height but decrease in weight, or increase in weight but decrease in height.
So PC2 represents the exceptions from our “the taller, the heavier” rule shown in PC1. This second principal component shows the dispersion around an expected value of height/weight, such as cases of over- or underweight. On a plot of the dataset using PC2 as vertical axis, the samples on the bottom correspond to persons lighter than expected for their height, and on the top, to persons heavier than expected for their height.
Height-Weight along the first two Principal Components
5
Not synced yet
Summary
PCA provides a clearer picture of our dataset, by changing the axes used to represent it. Each axis is a smart combination of our dataset’s variables, so that their correlation is baked in. This comes with a price tag: the meaning of the new axes can’t be the same as the previous ones, and has to be interpreted based on the factors used in this combination. To do so, we have to carefully looks at the loadings for the principal components of our dataset.
🏋️ Next we’ll flex our PCA muscles with a more advanced example, to help you decide
But before that, here’s what’s needed to interpret the principal components of our Movie Reviews example — the factors (or loadings), represented as charts. The interpretation is left to you :)
Movie Reviews: Loadings for PC1
7
Not synced yet
Movie Reviews: Loadings for PC2
7
Not synced yet
Want to print your doc? This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (