Different kind of inter-correlated data can be collected into a random vector. For example we would consider the total length, weight, diameter of nose holes, tail length, number of teeth, colour of eyes, the radius of acetabula, number of vertebrae of dinosaurs as data and collect them into a (random) vector. The principal component analysis (PCA) is a method for reducing the dimension of a random vector (data), and also for discovering outstanding features of data. Using PCA, one can replace a random vector with another vector having a smaller dimension without losing much information. In PCA, the measure of information loss is the total variation of the original variables collected in a random vector.
See this cheat sheet for the mathematical identities used in this post.
Least-squares optimality of PCA
Assume random variables () collected in a random r-vector
, i.e.
. It has a mean
and a covariance matrix
. A linear projection of the random variables means:
Each rows of linearly combines the random variables to create a new random variable.
Let’s project the r-vector onto a lower dimensional space and obtain the t-vector
with
. We want to find a linear map to reconstruct
out of
with lower dimension. This means finding a vector
and a matrix
such that
. The approximation is in the least-squares sense of minimizing the mean error. We consider the following as the measure of how well the linear map can reconstruct
out of its linear projection
:
Therefore, we should find ,
, and
by minimizing the following mean:
This expression is the mean of the squared norm/magnitude of the error vector which itself is a random vevtor.
With the assumption that both and
are of full rank t, the minimizing parameters will be (see the proof here):
where is a normalized eigenvector of the covariance matrix
and it is associated with the i-th largest eigenvalue,
, of
. The eigenvectors are orthonormal to each other, i.e.
(
is the Kronecker delta).
is the identity matrix.
From the structures of the matrices and
, it can be observed that they have rank t. Therefore, the best rank-t approximation of the original random vector
is
:
or:
\tilde X= \mu_x+\sum_{i=1}^{t}(v_iv_i^\text T)(X_c)\qquad \text{s.t }X_c=X-\mu_x\tag{3b}Remark: , and
.
The covariance matrix is real and symmetric, therefore it has a spectral decomposition as
s.t
is an orthogonal matrix (a matrix with ortho-normal columns/rows) with columns as normalized eigenvectors of
, and
is a diagonal matrix having the eigenvalues,
, of
. The eigenvectors in
are sorted such that the j-th column of
is associated with the j-th largest eigenvalue. Having said that, we observe that the best rank-t approximation of the original random vector only uses the first t eigenvectors of random vector’s covariance matrix.
The minimum of Eq. 1 is:
\sum_{i=t+1}^{r} \lambda_i \tag{4}Principal components (PC’s) of X: the first t principal components (also known as Karhunen-Loeve transformation) of are defined by the following linear projections:
The superscript is used to indicate the dimension of the PC vector, being equal to the number of PC’s.
Note that PC’s are themselves random variables as they are linear combination of the data random variables.
PC’s of X are uncorrelated: the covariance between each pair of the principal components is:
\text {Cov}(\xi_i,\xi_j)=\text {Cov}(v_i^{\text T}X,v_j^{\text T}X)= v_i^{\text T}\Sigma_{XX}v_j=\lambda_j v_i^{\text T}v_j=\delta_{ij}\lambda_j\\ \tag{6}Interestingly . This means that principal components of
are uncorrelated. Moreover,
.
A goodness-of-fit measure: equation 4 maps the r-dimensional data () into a lower t-dimensional data
. The total variance of the original data, i.e.
collected in
is
. Based on this, a measure on how well the first t principal components (in the lower r-D space) represent the r original data variables is considered as (also note Eq. 4 and 6):
This implies that the measure is small if the first t principal components contain a large part of the total variation in ; hence, better approximation.
Remark: if , Eq. 7 becomes zero and it means perfect match, i.e no approximation. In other words,
means
(
as in the SVD of
), then because
is unitary,
, hence (Eq. 3),
.
PCA as a variance-maximization method
The main aim of PCA is to reduce the dimension of data. The way a dinosaur would do a sort of reduction on a data vector is keeping one component and eating the rest all. It is a perfectly nonsense reduction, but many pieces of valuable information are lost. Another way, is to do a simple average . This approach is yet undesirable because it treats all components as equal importance (equal weight). Another approach is to do a weighted average with some constraint on the weights, like:
such that
. In matrix notation and using the Euclidean vector norm:
This constraint weighted average is called a standardised linear combination (SLC). However, the problem is how to uniquely and meaningfully determine the weights. PCA chooses the weights in order to maximize the variances of the SLC’s of the variables ‘s, or to maximize the variances of the linear projections of the random vector, i.e.
. SLC, is in fact the projection of the vector
onto the direction described by the weight vector. In a mathematical notation, we want to solve the following maximization problem:
In other words,
v=\argmax_{\{w:\|w\|=1\}}\text{Var}(w^\text TX)\tag{8b}Principal components of X are defined based on this maximization. The 1st PC of X, denoted by is the SLC (of the variables) with the highest variance. This is obtained by the maximizing problem expressed by Eq. 8.
then has the highest variance among all other SLC of the variables/projections of
:
is called the 1st principal direction (PD) or principal axis.
The 2nd PC of X, denoted by is the SLC (of the variables) with the highest variance such that the maximizing weight vector,
is also subjected to the orthogonality constraint
. In a mathematical notation, the 2nd PC is obtained by the following maximization:
by the same token, the i-th PC and i-th PD or principal axis are defined by the following:
v_i=\argmax_{\{w:\|w\|=1\text{ and }w^\text T v_j=0 \text{ for all } j=1,\dots,i-1\}}\text{Var}(w^\text TX) \\ \ \\ \text {PC}_i:=\xi_i:=v_i^\text TX\qquad \ \tag{11}It can be proved (see the references) that the maximizing weight vectors are the normalized eigenvectors of associated with its eigenvalues sorted in the decreasing order, i.e.:
Therefore:
\text{cov}(\xi_i,\xi_j)=\text {Cov}(v_i^\text TX,v_j^\text TX)=v_i^\text T\Sigma_{XX} v_j=\lambda_j v_i^{\text T}v_j=\delta_{ij}\lambda_j\\ \ \\ \implies \text{Var}(\xi_1)\ge \text{Var}(\xi_2)\ge\dots\ge\text{Var}(\xi_r)\ge0which means PC’s not only have the maximum variances among all the possible SLC of the variables, or all the possible projections of the random vector, but also they are mutually uncorrelated.
In a matrix form, we can write:
\xi=V^\text TX\qquad \xi,X\in\mathbb{R}^{r\times 1},\ V^\text T\in\mathbb{R}^{r\times r}\tag{12}where and
are the random vector and its linear projection vector containing the PC’s of
respectively.
It is readily concluded that:
\text{Var}(\xi)=\text{Var}(V^\text TX)=V^\text T\Sigma_{XX}V=\delta_{ij}\lambda_jPCA in practice: Sample PCA
In practice, the PC’s are to be estimated out of n independent observations of the random r-vector of data. This requires estimating of the mean and covariance matrix of the random vector of data; using sample mean and covariance.
Here, we use the notation for the i-th independent trial to observe the random vector
.
Note: we denote an observation vector with a capital letter to indicate that it is still a random vector (out of the same population and through independent trials). To indicate a particular realization/value of
we shall use small letter (by convention).
We collect the sequence of n independent observations of the random vector in a set as:
The estimate for is given by:
let for n observations of the random vector; and let
. Then, the estimate for the covariance matrix is:
Ordered eigenvalues and their associated eigenvectors of is then estimate the population parameters. This leads to writing:
In practice, the vectors of independent observations are collected as columns (or rows) of a matrix (just re-writing Eq. 13). Let’s call this matrix the matrix of observations. Here, each column represents an observation of :
Recalling Eq. 5 or Eq. 12, PC’s can be calculated for each columns of the matrix of observations. In this regards, we obtain estimates of the PC’s of :
This is the estimate of the j-th PC of calculated based on the i-th observation
. The term
is called the score for the i-th observation on the j-th PC.
For each observation, we can write the above in a more compact from:
\hat\xi_{i}^{(t)}=\hat A^\text T\bold X_i\ \in\mathbb{R}^{t\times 1}\\ i=1,\dots,n\tag{18}References
– Modern Multivariate Statistical Techniques, A. J. Izenman.
– Applied Multivariate Statistical Analysis, W. Hardle, L. Simar.
Although mostly paraphrased, this post might contain copyrighted sentences of the references.