PCA: Introduction

Different kind of inter-correlated data can be collected into a random vector. For example we would consider the total length, weight, diameter of nose holes, tail length, number of teeth, colour of eyes, the radius of acetabula, number of vertebrae of dinosaurs as data and collect them into a (random) vector. The principal component analysis (PCA) is a method for reducing the dimension of a random vector (data), and also for discovering outstanding features of data. Using PCA, one can replace a random vector with another vector having a smaller dimension without losing much information. In PCA, the measure of information loss is the total variation of the original variables collected in a random vector.

See this cheat sheet for the mathematical identities used in this post.

Least-squares optimality of PCA

Assume random variables (X_i) collected in a random r-vector X=[X_1,...,X_r]^T , i.e. X\in \mathbb{R}^{r\times 1}. It has a mean \mu_X \in  \mathbb{R}^{r\times 1} and a covariance matrix \Sigma_{XX} \in  \mathbb{R}^{r\times r}. A linear projection of the random variables means:

\xi=BX\text{ s.t. } \xi \in \mathbb{R}^{t\times 1}\text{ and }B \in \mathbb{R}^{t\times r} \\ t\le r

Each rows of B linearly combines the random variables to create a new random variable.

Let’s project the r-vector X onto a lower dimensional space and obtain the t-vector \xi with t\lt r. We want to find a linear map to reconstruct X out of \xi with lower dimension. This means finding a vector \mu \in  {R}^{r\times 1} and a matrix A \in \mathbb{R}^{r\times t} such that  X\approx \mu + A\xi. The approximation is in the least-squares sense of minimizing the mean error. We consider the following as the measure of how well the linear map can reconstruct X out of its linear projection \xi:

\text E\big [(X-\mu – A\xi)^T(X-\mu – A\xi)\big ]\ \in \mathbb{R}^{+}

Therefore, we should find \mu, A, and B by minimizing the following mean:

\text E\big [(X-\mu – ABX)^T(X-\mu – ABX)\big ]\ \in \mathbb{R}^+\tag{1}

This expression is the mean of the squared norm/magnitude of the error vector which itself is a random vevtor.

With the assumption that both A and B are of full rank t, the minimizing parameters will be (see the proof here):

A=\begin{bmatrix}|&|& \dots & |\\v_1 &v_2 &\dots &v_t\\|&|& \dots & | \end{bmatrix}=B^\text T\\ \ \\ \mu=(I_r-AB)\mu_X \tag{2}

where v_i \in \mathbb{R}^{r \times 1} is a normalized eigenvector of the covariance matrix \Sigma_{XX} and it is associated with the i-th largest eigenvalue, \lambda_i, of \Sigma_{XX}. The eigenvectors are orthonormal to each other, i.e. v_i^\text{T}v_j=\delta_{ij} (\delta_{ij} is the Kronecker delta). I_r is the identity matrix.

From the structures of the matrices A and B, it can be observed that they have rank t. Therefore, the best rank-t approximation of the original random vector X is \tilde X:

X\approx \tilde X= \mu_x+C(X-\mu_x)\qquad \text{s.t }C=AB=AA^\text T \in \mathbb{R}^{r\times r}\tag{3a}

or:

\tilde X= \mu_x+\sum_{i=1}^{t}(v_iv_i^\text T)(X_c)\qquad \text{s.t }X_c=X-\mu_x\tag{3b}

Remark: \text E\big[X\big]=\text E\big[\tilde X\big]=\mu_x, and \text{Var}(\tilde X)=C\Sigma_{XX}C^\text T= C\text{Var}(X) C^\text T \in \mathbb{R}^{r\times r}.

The covariance matrix \Sigma_{XX} \in \mathbb{R}^{r\times r} is real and symmetric, therefore it has a spectral decomposition as \Sigma_{XX}=V \varLambda V^\text T s.t V \in \mathbb{R}^{r\times r} is an orthogonal matrix (a matrix with ortho-normal columns/rows) with columns as normalized eigenvectors of \Sigma_{XX} , and  \varLambda is a diagonal matrix having the eigenvalues, \lambda_i, of \Sigma_{XX} . The eigenvectors in V are sorted such that the j-th column of V is associated with the j-th largest eigenvalue. Having said that, we observe that the best rank-t approximation of the original random vector only uses the first t eigenvectors of random vector’s covariance matrix.

The minimum of Eq. 1 is:

\sum_{i=t+1}^{r} \lambda_i \tag{4}

Principal components (PC’s) of X: the first t principal components (also known as Karhunen-Loeve transformation) of X are defined by the following linear projections:

\xi^{(t)}=A^{\text T}X\ \in \mathbb{R}^{t\times 1} \implies\xi_i^{(t)}=v_i^{\text T}X\qquad i=1,\dots ,t\le r\tag{5}

The superscript (t) is used to indicate the dimension of the PC vector, being equal to the number of PC’s.

Note that PC’s are themselves random variables as they are linear combination of the data random variables.

PC’s of X are uncorrelated: the covariance between each pair of the principal components is:

\text {Cov}(\xi_i,\xi_j)=\text {Cov}(v_i^{\text T}X,v_j^{\text T}X)= v_i^{\text T}\Sigma_{XX}v_j=\lambda_j v_i^{\text T}v_j=\delta_{ij}\lambda_j\\ \tag{6}

Interestingly \text {Cov}(\xi_i,\xi_j)=0\ \text{ for } i\neq j. This means that principal components of X are uncorrelated. Moreover, \text {Cov}(\xi_i,\xi_i)= \text {Var}(\xi_i)\lt  \text {Var}(\xi_j)\iff  i\lt j.

A goodness-of-fit measure: equation 4 maps the r-dimensional data (X) into a lower t-dimensional data \xi. The total variance of the original data, i.e. X_i collected in X is \sum_{i=1}^{r}\text{Var}(X_i)=\text{Trace}(\Sigma_{XX})= \sum_{i=1}^{r}\lambda_i. Based on this, a measure on how well the first t principal components (in the lower r-D space) represent the r original data variables is considered as (also note Eq. 4 and 6):

\frac{\sum_{i=t+1}^{r} \lambda_i}{\sum_{i=1}^{r} \lambda_i}=\frac{\text{total variance of }X_i\text{\textquoteright s}-\text{total variance of }\xi_i\text{\textquoteright s}}{\text{total variance of }X_i\text{\textquoteright s}}\tag{7}

This implies that the measure is small if the first t principal components contain a large part of the total variation in X; hence, better approximation.

Remark: if t=r, Eq. 7 becomes zero and it means perfect match, i.e no approximation. In other words, t=r means A=V (V as in the SVD of \Sigma_{XX}= V \varLambda V^\text T), then because V is unitary, C=VV^\text T=I_r, hence (Eq. 3),  \tilde X= \mu_x+I(X-\mu_x)=X.

PCA as a variance-maximization method

The main aim of PCA is to reduce the dimension of data. The way a dinosaur would do a sort of reduction on a data vector is keeping one component and eating the rest all. It is a perfectly nonsense reduction, but many pieces of valuable information are lost. Another way, is to do a simple average r^{-1}\sum_{i=1}^{r}X_i. This approach is yet undesirable because it treats all components as equal importance (equal weight). Another approach is to do a weighted average with some constraint on the weights, like: \sum_{i=1}^{r}w_iX_i such that \sum_{i=1}^{r}w_i^2=1. In matrix notation and using the Euclidean vector norm:

\sum_{i=1}^{r}w_iX_i=w^\text TX \quad \text{ s.t }\quad \sum_{i=1}^{r}w_i^2=w^\text T w=\|w\|^2=1\qquad w_i\in \mathbb{R}

This constraint weighted average is called a standardised linear combination (SLC). However, the problem is how to uniquely and meaningfully determine the weights. PCA chooses the weights in order to maximize the variances of the SLC’s of the variables X_i‘s, or to maximize the variances of the linear projections of the random vector, i.e. w^\text T X,\ \in\mathbb{R}. SLC, is in fact the projection of the vector X onto the direction described by the weight vector. In a mathematical notation, we want to solve the following maximization problem:

\max_{\{w:\|w\|=1\}}\text{Var}(w^\text TX)=\max_{\{w:\|w\|=1\}}w^\text T\text{Var}(X)w=\max_{\{w:\|w\|=1\}}w^\text T\Sigma_{XX}(X)w\\ \ \\ w\in \mathbb{R}^{r\times 1},\quad X\in \mathbb{R}^{r\times 1} \quad,\Sigma_{XX}\in \mathbb{R}^{r\times r} \ \tag{8a}

In other words,

v=\argmax_{\{w:\|w\|=1\}}\text{Var}(w^\text TX)\tag{8b}

Principal components of X are defined based on this maximization. The 1st PC of X, denoted by \xi_1 is the SLC (of the variables) with the highest variance. This is obtained by the maximizing problem expressed by Eq. 8. \text{PC}_1 then has the highest variance among all other SLC of the variables/projections of X:

v_1=\argmax_{\{w:\|w\|=1\}}\text{Var}(w^\text TX)\qquad v_1\in \mathbb{R}^{r\times 1} \\ \ \\ \text {PC}_1:=\xi_1:=v_1^\text TX\qquad \xi_1\in\mathbb{R}\ \tag{9}

v_1 is called the 1st principal direction (PD) or principal axis.

The 2nd PC of X, denoted by \xi_2 is the SLC (of the variables) with the highest variance such that the maximizing weight vector,v_2 is also subjected to the orthogonality constraint v_2^\text T v_1=0. In a mathematical notation, the 2nd PC is obtained by the following maximization:

v_2=\argmax_{\{w:\|w\|=1 \text{ and }w^\text Tv_1=0\}}\text{Var}(w^\text TX) \\ \ \\ \text {PC}_2:=\xi_2:=v_2^\text TX\qquad \ \tag{10}

by the same token, the i-th PC and i-th PD or principal axis are defined by the following:

v_i=\argmax_{\{w:\|w\|=1\text{ and }w^\text T v_j=0 \text{ for all } j=1,\dots,i-1\}}\text{Var}(w^\text TX) \\ \ \\ \text {PC}_i:=\xi_i:=v_i^\text TX\qquad \ \tag{11}

It can be proved (see the references) that the maximizing weight vectors are the normalized eigenvectors of \Sigma_{XX} associated with its eigenvalues sorted in the decreasing order, i.e.:

\Sigma_{XX}=V\varLambda V^\text T\ \text{ s.t } \lambda_{i}\ge \lambda_{j} \implies i\lt j\\ \ \\ \text{i.e. }\lambda_1\ge \lambda_2\ge \dots \ge\lambda_r \ge 0

Therefore:

\text{cov}(\xi_i,\xi_j)=\text {Cov}(v_i^\text TX,v_j^\text TX)=v_i^\text T\Sigma_{XX} v_j=\lambda_j v_i^{\text T}v_j=\delta_{ij}\lambda_j\\ \ \\ \implies \text{Var}(\xi_1)\ge \text{Var}(\xi_2)\ge\dots\ge\text{Var}(\xi_r)\ge0

which means PC’s not only have the maximum variances among all the possible SLC of the variables, or all the possible projections of the random vector, but also they are mutually uncorrelated.

In a matrix form, we can write:

\xi=V^\text TX\qquad \xi,X\in\mathbb{R}^{r\times 1},\ V^\text T\in\mathbb{R}^{r\times r}\tag{12}

where X and \xi are the random vector and its linear projection vector containing the PC’s of X respectively.

It is readily concluded that:

\text{Var}(\xi)=\text{Var}(V^\text TX)=V^\text T\Sigma_{XX}V=\delta_{ij}\lambda_j

PCA in practice: Sample PCA

In practice, the PC’s are to be estimated out of n independent observations of the random r-vector X of data. This requires estimating of the mean and covariance matrix of the random vector of data; using sample mean and covariance.

Here, we use the notation \bold X_i\in \mathbb{R}^{r\times 1} for the i-th independent trial to observe the random vector X.

Note: we denote an observation vector with a capital letter \bold X_i to indicate that it is still a random vector (out of the same population and through independent trials). To indicate a particular realization/value of \bold X_i we shall use small letter (by convention).

We collect the sequence of n independent observations of the random vector X in a set as:

\{\bold X_i, i=1,2,\dots,n\}\ \text{ s.t }\bold X_i\in\mathbb{R}^{r\times 1}\tag{13}

The estimate for \mu_X is given by:

\hat \mu_X=\bold{\bar X}:=\frac{1}{n}\sum_{i=1}^{n}\bold X_i\tag{14}

let \bold X_{ci}=\bold X_i-\bold{\bar X} for n observations of the random vector; and let \chi_c={\bold X_{c1},\dots,\bold X_{cn}}\in \mathbb{R}^{r\times n}. Then, the estimate for the covariance matrix is:

\hat\Sigma_{XX}=\frac{1}{n}\chi_c\chi_c^\text T\tag{15}

Ordered eigenvalues and their associated eigenvectors of \hat\Sigma_{XX} is then estimate the population parameters. This leads to writing:

\hat\lambda_1\ge\hat\lambda_2\ge\dots\ge\hat\lambda_r\ge 0\\ \ \\ \hat A=(\hat v_1,\dots,\hat v_t)=\hat B^\text T\tag{16}

In practice, the vectors of independent observations are collected as columns (or rows) of a matrix (just re-writing Eq. 13). Let’s call this matrix the matrix of observations. Here, each column represents an observation of X:

\bold X=\begin{bmatrix}|&|&\dots &|\\ \bold X_1 & \bold X_2 &\dots &\bold X_n \\|&|&\dots &| \end{bmatrix}\in \mathbb{R}^{r\times n}\tag{17}

Recalling Eq. 5 or Eq. 12, PC’s can be calculated for each columns of the matrix of observations. In this regards, we obtain estimates of the PC’s of X:

\hat\xi_{ij}^{(t)}=\hat v_j^\text T\bold X_i\ \in\mathbb{R}\\ j=1,\dots,t\ \text{ and }\ i=1,\dots,n

This is the estimate of the j-th PC of X calculated based on the i-th observation \bold X_i. The term \hat\xi_{ij}^{(t)} is called the score for the i-th observation on the j-th PC.

For each observation, we can write the above in a more compact from:

\hat\xi_{i}^{(t)}=\hat A^\text T\bold X_i\ \in\mathbb{R}^{t\times 1}\\ i=1,\dots,n\tag{18}

References

– Modern Multivariate Statistical Techniques, A. J. Izenman.
– Applied Multivariate Statistical Analysis, W. Hardle, L. Simar.

Although mostly paraphrased, this post might contain copyrighted sentences of the references.