A Medley of Potpourri

Sunday, March 23, 2025

Principal component analysis

From Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Principal_component_analysis

PCA of a multivariate Gaussian distribution centered at (1,3) with a standard deviation of 3 in roughly the (0.866, 0.5) direction and of 1 in the orthogonal direction. The vectors shown are the eigenvectors of the covariance matrix scaled by the square root of the corresponding eigenvalue, and shifted so their tails are at the mean.

Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing.

The data is linearly transformed onto a new coordinate system such that the directions (principal components) capturing the largest variation in the data can be easily identified.

The principal components of a collection of points in a real coordinate space are a sequence of $p$ unit vectors, where the $i$ -th vector is the direction of a line that best fits the data while being orthogonal to the first $i - 1$ vectors. Here, a best-fitting line is defined as one that minimizes the average squared perpendicular distance from the points to the line. These directions (i.e., principal components) constitute an orthonormal basis in which different individual dimensions of the data are linearly uncorrelated. Many studies use the first two principal components in order to plot the data in two dimensions and to visually identify clusters of closely related data points.

Principal component analysis has applications in many fields such as population genetics, microbiome studies, and atmospheric science.

Overview

When performing PCA, the first principal component of a set of $p$ variables is the derived variable formed as a linear combination of the original variables that explains the most variance. The second principal component explains the most variance in what is left once the effect of the first component is removed, and we may proceed through $p$ iterations until all the variance is explained. PCA is most commonly used when many of the variables are highly correlated with each other and it is desirable to reduce their number to an independent set. The first principal component can equivalently be defined as a direction that maximizes the variance of the projected data. The $i$ -th principal component can be taken as a direction orthogonal to the first $i - 1$ principal components that maximizes the variance of the projected data.

For either objective, it can be shown that the principal components are eigenvectors of the data's covariance matrix. Thus, the principal components are often computed by eigendecomposition of the data covariance matrix or singular value decomposition of the data matrix. PCA is the simplest of the true eigenvector-based multivariate analyses and is closely related to factor analysis. Factor analysis typically incorporates more domain-specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. PCA is also related to canonical correlation analysis (CCA). CCA defines coordinate systems that optimally describe the cross-covariance between two datasets while PCA defines a new orthogonal coordinate system that optimally describes variance in a single dataset. Robust and L1-norm-based variants of standard PCA have also been proposed.

History

PCA was invented in 1901 by Karl Pearson, as an analogue of the principal axis theorem in mechanics; it was later independently developed and named by Harold Hotelling in the 1930s. Depending on the field of application, it is also named the discrete Karhunen–Loève transform (KLT) in signal processing, the Hotelling transform in multivariate quality control, proper orthogonal decomposition (POD) in mechanical engineering, singular value decomposition (SVD) of X (invented in the last quarter of the 19th century), eigenvalue decomposition (EVD) of X^TX in linear algebra, factor analysis (for a discussion of the differences between PCA and factor analysis see Ch. 7 of Jolliffe's Principal Component Analysis), Eckart–Young theorem (Harman, 1960), or empirical orthogonal functions (EOF) in meteorological science (Lorenz, 1956), empirical eigenfunction decomposition (Sirovich, 1987), quasiharmonic modes (Brooks et al., 1988), spectral decomposition in noise and vibration, and empirical modal analysis in structural dynamics.

Intuition

PCA can be thought of as fitting a p-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component. If some axis of the ellipsoid is small, then the variance along that axis is also small.

To find the axes of the ellipsoid, we must first center the values of each variable in the dataset on 0 by subtracting the mean of the variable's observed values from each of those values. These transformed values are used instead of the original observed values for each of the variables. Then, we compute the covariance matrix of the data and calculate the eigenvalues and corresponding eigenvectors of this covariance matrix. Then we must normalize each of the orthogonal eigenvectors to turn them into unit vectors. Once this is done, each of the mutually-orthogonal unit eigenvectors can be interpreted as an axis of the ellipsoid fitted to the data. This choice of basis will transform the covariance matrix into a diagonalized form, in which the diagonal elements represent the variance of each axis. The proportion of the variance that each eigenvector represents can be calculated by dividing the eigenvalue corresponding to that eigenvector by the sum of all eigenvalues.

Biplots and scree plots (degree of explained variance) are used to interpret findings of the PCA.

Details

PCA is defined as an orthogonal linear transformation on a real inner product space that transforms the data to a new coordinate system such that the greatest variance by some scalar projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

Consider an $n \times p$ data matrix, X, with column-wise zero empirical mean (the sample mean of each column has been shifted to zero), where each of the n rows represents a different repetition of the experiment, and each of the p columns gives a particular kind of feature (say, the results from a particular sensor).

Mathematically, the transformation is defined by a set of size $l$ of p-dimensional vectors of weights or coefficients $w_{(k)} = (w_{1}, \dots, w_{p})_{(k)}$ that map each row vector $x_{(i)} = (x_{1}, \dots, x_{p})_{(i)}$ of X to a new vector of principal component scores $t_{(i)} = (t_{1}, \dots, t_{l})_{(i)}$ , given by

{t_{k}}_{(i)} = x_{(i)} \cdot w_{(k)} f o r i = 1, \dots, n k = 1, \dots, l

in such a way that the individual variables $t_{1}, \dots, t_{l}$ of t considered over the data set successively inherit the maximum possible variance from X, with each coefficient vector w constrained to be a unit vector (where $l$ is usually selected to be strictly less than $p$ to reduce dimensionality).

The above may equivalently be written in matrix form as

T = X W

where $T_{i k} = {t_{k}}_{(i)}$ , $X_{i j} = {x_{j}}_{(i)}$ , and $W_{j k} = {w_{j}}_{(k)}$ .

First component

In order to maximize variance, the first weight vector w₍₁₎ thus has to satisfy

w_{(1)} = \arg max_{‖ w ‖ = 1} {\sum_{i} (t_{1})_{(i)}^{2}} = \arg max_{‖ w ‖ = 1} {\sum_{i} {(x_{(i)} \cdot w)}^{2}}

Equivalently, writing this in matrix form gives

w_{(1)} = \arg max_{‖ w ‖ = 1} {{‖ X w ‖}^{2}} = \arg max_{‖ w ‖ = 1} {w^{T} X^{T} X w}

Since w₍₁₎ has been defined to be a unit vector, it equivalently also satisfies

w_{(1)} = \arg max {\frac{w^{T} X^{T} X w}{w^{T} w}}

The quantity to be maximised can be recognised as a Rayleigh quotient. A standard result for a positive semidefinite matrix such as X^TX is that the quotient's maximum possible value is the largest eigenvalue of the matrix, which occurs when w is the corresponding eigenvector.

With w₍₁₎ found, the first principal component of a data vector x_(i) can then be given as a score t_1(i) = x_(i) ⋅ w₍₁₎ in the transformed co-ordinates, or as the corresponding vector in the original variables, {x_(i) ⋅ w₍₁₎} w₍₁₎.

Further components

The k-th component can be found by subtracting the first k − 1 principal components from X:

{\hat{X}}_{k} = X - \sum_{s = 1}^{k - 1} X w_{(s)} w_{(s)}^{T}

and then finding the weight vector which extracts the maximum variance from this new data matrix

w_{(k)} = \underset{‖ w ‖ = 1}{a r g m a x} {{‖ {\hat{X}}_{k} w ‖}^{2}} = \arg max {\frac{w^{T} {\hat{X}}_{k}^{T} {\hat{X}}_{k} w}{w^{T} w}}

It turns out that this gives the remaining eigenvectors of X^TX, with the maximum values for the quantity in brackets given by their corresponding eigenvalues. Thus the weight vectors are eigenvectors of X^TX.

The k-th principal component of a data vector x_(i) can therefore be given as a score t_k(i) = x_(i) ⋅ w_(k) in the transformed coordinates, or as the corresponding vector in the space of the original variables, {x_(i) ⋅ w_(k)} w_(k), where w_(k) is the kth eigenvector of X^TX.

The full principal components decomposition of X can therefore be given as

T = X W

where W is a p-by-p matrix of weights whose columns are the eigenvectors of X^TX. The transpose of W is sometimes called the whitening or sphering transformation. Columns of W multiplied by the square root of corresponding eigenvalues, that is, eigenvectors scaled up by the variances, are called loadings in PCA or in Factor analysis.

Covariances

X^TX itself can be recognized as proportional to the empirical sample covariance matrix of the dataset X^T.

The sample covariance Q between two of the different principal components over the dataset is given by:

\begin{aligned} Q ({P C}_{(j)}, {P C}_{(k)}) & \propto (X w_{(j)})^{T} (X w_{(k)}) \\ = w_{(j)}^{T} X^{T} X w_{(k)} \\ = w_{(j)}^{T} λ_{(k)} w_{(k)} \\ = λ_{(k)} w_{(j)}^{T} w_{(k)} \end{aligned}

where the eigenvalue property of w_(k) has been used to move from line 2 to line 3. However eigenvectors w_(j) and w_(k) corresponding to eigenvalues of a symmetric matrix are orthogonal (if the eigenvalues are different), or can be orthogonalised (if the vectors happen to share an equal repeated value). The product in the final line is therefore zero; there is no sample covariance between different principal components over the dataset.

Another way to characterise the principal components transformation is therefore as the transformation to coordinates which diagonalise the empirical sample covariance matrix.

In matrix form, the empirical covariance matrix for the original variables can be written

Q \propto X^{T} X = W Λ W^{T}

The empirical covariance matrix between the principal components becomes

W^{T} Q W \propto W^{T} W Λ W^{T} W = Λ

where Λ is the diagonal matrix of eigenvalues λ_(k) of X^TX. λ_(k) is equal to the sum of the squares over the dataset associated with each component k, that is, λ_(k) = Σ_i t_k²_(i) = Σ_i (x_(i) ⋅ w_(k))².

Dimensionality reduction

The transformation T = X W maps a data vector x_(i) from an original space of p variables to a new space of p variables which are uncorrelated over the dataset. However, not all the principal components need to be kept. Keeping only the first L principal components, produced by using only the first L eigenvectors, gives the truncated transformation

T_{L} = X W_{L}

where the matrix T_L now has n rows but only L columns. In other words, PCA learns a linear transformation $t = W_{L}^{T} x, x \in R^{p}, t \in R^{L},$ where the columns of $p \times L$ matrix $W_{L}$ form an orthogonal basis for the L features (the components of representation t) that are decorrelated. By construction, of all the transformed data matrices with only L columns, this score matrix maximises the variance in the original data that has been preserved, while minimising the total squared reconstruction error $‖ T W^{T} - T_{L} W_{L}^{T} ‖_{2}^{2}$ or $‖ X - X_{L} ‖_{2}^{2}$ .

Such dimensionality reduction can be a very useful step for visualising and processing high-dimensional datasets, while still retaining as much of the variance in the dataset as possible. For example, selecting L = 2 and keeping only the first two principal components finds the two-dimensional plane through the high-dimensional dataset in which the data is most spread out, so if the data contains clusters these too may be most spread out, and therefore most visible to be plotted out in a two-dimensional diagram; whereas if two directions through the data (or two of the original variables) are chosen at random, the clusters may be much less spread apart from each other, and may in fact be much more likely to substantially overlay each other, making them indistinguishable.

Similarly, in regression analysis, the larger the number of explanatory variables allowed, the greater is the chance of overfitting the model, producing conclusions that fail to generalise to other datasets. One approach, especially when there are strong correlations between different possible explanatory variables, is to reduce them to a few principal components and then run the regression against them, a method called principal component regression.

Dimensionality reduction may also be appropriate when the variables in a dataset are noisy. If each column of the dataset contains independent identically distributed Gaussian noise, then the columns of T will also contain similarly identically distributed Gaussian noise (such a distribution is invariant under the effects of the matrix W, which can be thought of as a high-dimensional rotation of the co-ordinate axes). However, with more of the total variance concentrated in the first few principal components compared to the same noise variance, the proportionate effect of the noise is less—the first few components achieve a higher signal-to-noise ratio. PCA thus can have the effect of concentrating much of the signal into the first few principal components, which can usefully be captured by dimensionality reduction; while the later principal components may be dominated by noise, and so disposed of without great loss. If the dataset is not too large, the significance of the principal components can be tested using parametric bootstrap, as an aid in determining how many principal components to retain.

Singular value decomposition

The principal components transformation can also be associated with another matrix factorization, the singular value decomposition (SVD) of X,

X = U Σ W^{T}

Here Σ is an n-by-p rectangular diagonal matrix of positive numbers σ_(k), called the singular values of X; U is an n-by-n matrix, the columns of which are orthogonal unit vectors of length n called the left singular vectors of X; and W is a p-by-p matrix whose columns are orthogonal unit vectors of length p and called the right singular vectors of X.

In terms of this factorization, the matrix X^TX can be written

\begin{aligned} X^{T} X & = W Σ^{T} U^{T} U Σ W^{T} \\ = W Σ^{T} Σ W^{T} \\ = W {\hat{Σ}}^{2} W^{T} \end{aligned}

where $\hat{Σ}$ is the square diagonal matrix with the singular values of X and the excess zeros chopped off that satisfies ${\hat{Σ}}^{2} = Σ^{T} Σ$ . Comparison with the eigenvector factorization of X^TX establishes that the right singular vectors W of X are equivalent to the eigenvectors of X^TX, while the singular values σ_(k) of $X$ are equal to the square-root of the eigenvalues λ_(k) of X^TX.

Using the singular value decomposition the score matrix T can be written

\begin{aligned} T & = X W \\ = U Σ W^{T} W \\ = U Σ \end{aligned}

so each column of T is given by one of the left singular vectors of X multiplied by the corresponding singular value. This form is also the polar decomposition of T.

Efficient algorithms exist to calculate the SVD of X without having to form the matrix X^TX, so computing the SVD is now the standard way to calculate a principal components analysis from a data matrix, unless only a handful of components are required.

As with the eigen-decomposition, a truncated $n \times L$ score matrix T_L can be obtained by considering only the first L largest singular values and their singular vectors:

T_{L} = U_{L} Σ_{L} = X W_{L}

The truncation of a matrix M or T using a truncated singular value decomposition in this way produces a truncated matrix that is the nearest possible matrix of rank L to the original matrix, in the sense of the difference between the two having the smallest possible Frobenius norm, a result known as the Eckart–Young theorem [1936].

Further considerations

The singular values (in Σ) are the square roots of the eigenvalues of the matrix X^TX. Each eigenvalue is proportional to the portion of the "variance" (more correctly of the sum of the squared distances of the points from their multidimensional mean) that is associated with each eigenvector. The sum of all the eigenvalues is equal to the sum of the squared distances of the points from their multidimensional mean. PCA essentially rotates the set of points around their mean in order to align with the principal components. This moves as much of the variance as possible (using an orthogonal transformation) into the first few dimensions. The values in the remaining dimensions, therefore, tend to be small and may be dropped with minimal loss of information (see below). PCA is often used in this manner for dimensionality reduction. PCA has the distinction of being the optimal orthogonal transformation for keeping the subspace that has largest "variance" (as defined above). This advantage, however, comes at the price of greater computational requirements if compared, for example, and when applicable, to the discrete cosine transform, and in particular to the DCT-II which is simply known as the "DCT". Nonlinear dimensionality reduction techniques tend to be more computationally demanding than PCA.

PCA is sensitive to the scaling of the variables. If we have just two variables and they have the same sample variance and are completely correlated, then the PCA will entail a rotation by 45° and the "weights" (they are the cosines of rotation) for the two variables with respect to the principal component will be equal. But if we multiply all values of the first variable by 100, then the first principal component will be almost the same as that variable, with a small contribution from the other variable, whereas the second component will be almost aligned with the second original variable. This means that whenever the different variables have different units (like temperature and mass), PCA is a somewhat arbitrary method of analysis. (Different results would be obtained if one used Fahrenheit rather than Celsius for example.) Pearson's original paper was entitled "On Lines and Planes of Closest Fit to Systems of Points in Space" – "in space" implies physical Euclidean space where such concerns do not arise. One way of making the PCA less arbitrary is to use variables scaled so as to have unit variance, by standardizing the data and hence use the autocorrelation matrix instead of the autocovariance matrix as a basis for PCA. However, this compresses (or expands) the fluctuations in all dimensions of the signal space to unit variance.

Mean subtraction (a.k.a. "mean centering") is necessary for performing classical PCA to ensure that the first principal component describes the direction of maximum variance. If mean subtraction is not performed, the first principal component might instead correspond more or less to the mean of the data. A mean of zero is needed for finding a basis that minimizes the mean square error of the approximation of the data.

Mean-centering is unnecessary if performing a principal components analysis on a correlation matrix, as the data are already centered after calculating correlations. Correlations are derived from the cross-product of two standard scores (Z-scores) or statistical moments (hence the name: Pearson Product-Moment Correlation). Also see the article by Kromrey & Foster-Johnson (1998) on "Mean-centering in Moderated Regression: Much Ado About Nothing". Since covariances are correlations of normalized variables (Z- or standard-scores) a PCA based on the correlation matrix of X is equal to a PCA based on the covariance matrix of Z, the standardized version of X.

PCA is a popular primary technique in pattern recognition. It is not, however, optimized for class separability. However, it has been used to quantify the distance between two or more classes by calculating center of mass for each class in principal component space and reporting Euclidean distance between center of mass of two or more classes. The linear discriminant analysis is an alternative which is optimized for class separability.

Table of symbols and abbreviations

Symbol	Meaning	Dimensions	Indices
$X = [X_{i j}]$	data matrix, consisting of the set of all data vectors, one vector per row	$n \times p$	$i = 1 \dots n$ $j = 1 \dots p$
$n$	the number of row vectors in the data set	$1 \times 1$	scalar
$p$	the number of elements in each row vector (dimension)	$1 \times 1$	scalar
$L$	the number of dimensions in the dimensionally reduced subspace, $1 \leq L \leq p$	$1 \times 1$	scalar
$u = [u_{j}]$	vector of empirical means, one mean for each column j of the data matrix	$p \times 1$	$j = 1 \dots p$
$s = [s_{j}]$	vector of empirical standard deviations, one standard deviation for each column j of the data matrix	$p \times 1$	$j = 1 \dots p$
$h = [h_{i}]$	vector of all 1's	$1 \times n$	$i = 1 \dots n$
$B = [B_{i j}]$	deviations from the mean of each column j of the data matrix	$n \times p$	$i = 1 \dots n$ $j = 1 \dots p$
$Z = [Z_{i j}]$	z-scores, computed using the mean and standard deviation for each column j of the data matrix	$n \times p$	$i = 1 \dots n$ $j = 1 \dots p$
$C = [C_{j j^{'}}]$	covariance matrix	$p \times p$	$j = 1 \dots p$ $j^{'} = 1 \dots p$
$R = [R_{j j^{'}}]$	correlation matrix	$p \times p$	$j = 1 \dots p$ $j^{'} = 1 \dots p$
$V = [V_{j j^{'}}]$	matrix consisting of the set of all eigenvectors of C, one eigenvector per column	$p \times p$	$j = 1 \dots p$ $j^{'} = 1 \dots p$
$D = [D_{j j^{'}}]$	diagonal matrix consisting of the set of all eigenvalues of C along its principal diagonal, and 0 for all other elements ( note $Λ$ used above )	$p \times p$	$j = 1 \dots p$ $j^{'} = 1 \dots p$
$W = [W_{j l}]$	matrix of basis vectors, one vector per column, where each basis vector is one of the eigenvectors of C, and where the vectors in W are a sub-set of those in V	$p \times L$	$j = 1 \dots p$ $l = 1 \dots L$
$T = [T_{i l}]$	matrix consisting of n row vectors, where each vector is the projection of the corresponding data vector from matrix X onto the basis vectors contained in the columns of matrix W.	$n \times L$	$i = 1 \dots n$ $l = 1 \dots L$

Properties and limitations

Properties

Some properties of PCA include:

Property 1: For any integer q, 1 ≤ q ≤ p, consider the orthogonal linear transformation

y = B^{'} x

where

y

is a q-element vector and

B^{'}

is a (q × p) matrix, and let

Σ_{y} = B^{'} Σ B

be the variance-covariance matrix for

y

. Then the trace of

Σ_{y}

, denoted

tr (Σ_{y})

, is maximized by taking

B = A_{q}

, where

A_{q}

consists of the first q columns of

A

(B^{'}

is the transpose of

B)

. (

A

is not defined here)

Property 2: Consider again the orthonormal transformation

y = B^{'} x

with

x, B, A

and

Σ_{y}

defined as before. Then

tr (Σ_{y})

is minimized by taking

B = A_{q}^{*},

where

A_{q}^{*}

consists of the last q columns of

A

The statistical implication of this property is that the last few PCs are not simply unstructured left-overs after removing the important PCs. Because these last PCs have variances as small as possible they are useful in their own right. They can help to detect unsuspected near-constant linear relationships between the elements of $x$ , and they may also be useful in regression, in selecting a subset of variables from $x$ , and in outlier detection.

Property 3: (Spectral decomposition of

Σ

)

Σ = λ_{1} α_{1} α_{1}^{'} + \dots + λ_{p} α_{p} α_{p}^{'}

Before we look at its usage, we first look at diagonal elements,

Var (x_{j}) = \sum_{k = 1}^{P} λ_{k} α_{k j}^{2}

Then, perhaps the main statistical implication of the result is that not only can we decompose the combined variances of all the elements of $x$ into decreasing contributions due to each PC, but we can also decompose the whole covariance matrix into contributions $λ_{k} α_{k} α_{k}^{'}$ from each PC. Although not strictly decreasing, the elements of $λ_{k} α_{k} α_{k}^{'}$ will tend to become smaller as $k$ increases, as $λ_{k} α_{k} α_{k}^{'}$ is nonincreasing for increasing $k$ , whereas the elements of $α_{k}$ tend to stay about the same size because of the normalization constraints: $α_{k}^{'} α_{k} = 1, k = 1, \dots, p$ .

Limitations

As noted above, the results of PCA depend on the scaling of the variables. This can be cured by scaling each feature by its standard deviation, so that one ends up with dimensionless features with unital variance.

The applicability of PCA as described above is limited by certain (tacit) assumptionsde in its derivation. In particular, PCA can capture linear correlations between the features but fails when this assumption is violated (see Figure 6a in the reference). In some cases, coordinate transformations can restore the linearity assumption and PCA can then be applied (see kernel PCA).

Another limitation is the mean-removal process before constructing the covariance matrix for PCA. In fields such as astronomy, all the signals are non-negative, and the mean-removal process will force the mean of some astrophysical exposures to be zero, which consequently creates unphysical negative fluxes, and forward modeling has to be performed to recover the true magnitude of the signals. As an alternative method, non-negative matrix factorization focusing only on the non-negative elements in the matrices is well-suited for astrophysical observations. See more at the relation between PCA and non-negative matrix factorization.

PCA is at a disadvantage if the data has not been standardized before applying the algorithm to it. PCA transforms the original data into data that is relevant to the principal components of that data, which means that the new data variables cannot be interpreted in the same ways that the originals were. They are linear interpretations of the original variables. Also, if PCA is not performed properly, there is a high likelihood of information loss.

PCA relies on a linear model. If a dataset has a pattern hidden inside it that is nonlinear, then PCA can actually steer the analysis in the complete opposite direction of progress. Researchers at Kansas State University discovered that the sampling error in their experiments impacted the bias of PCA results. "If the number of subjects or blocks is smaller than 30, and/or the researcher is interested in PC's beyond the first, it may be better to first correct for the serial correlation, before PCA is conducted". The researchers at Kansas State also found that PCA could be "seriously biased if the autocorrelation structure of the data is not correctly handled".

PCA and information theory

Dimensionality reduction results in a loss of information, in general. PCA-based dimensionality reduction tends to minimize that information loss, under certain signal and noise models.

Under the assumption that

x = s + n,

that is, that the data vector $x$ is the sum of the desired information-bearing signal $s$ and a noise signal $n$ one can show that PCA can be optimal for dimensionality reduction, from an information-theoretic point-of-view.

In particular, Linsker showed that if $s$ is Gaussian and $n$ is Gaussian noise with a covariance matrix proportional to the identity matrix, the PCA maximizes the mutual information $I (y; s)$ between the desired information $s$ and the dimensionality-reduced output $y = W_{L}^{T} x$ .

If the noise is still Gaussian and has a covariance matrix proportional to the identity matrix (that is, the components of the vector $n$ are iid), but the information-bearing signal $s$ is non-Gaussian (which is a common scenario), PCA at least minimizes an upper bound on the information loss, which is defined as

I (x; s) - I (y; s) .

The optimality of PCA is also preserved if the noise $n$ is iid and at least more Gaussian (in terms of the Kullback–Leibler divergence) than the information-bearing signal $s$ . In general, even if the above signal model holds, PCA loses its information-theoretic optimality as soon as the noise $n$ becomes dependent.

Computation using the covariance method

The following is a detailed description of PCA using the covariance method as opposed to the correlation method.

The goal is to transform a given data set X of dimension p to an alternative data set Y of smaller dimension L. Equivalently, we are seeking to find the matrix Y, where Y is the Karhunen–Loève transform (KLT) of matrix X:

$Y = K L T {X}$

Organize the data set

Suppose you have data comprising a set of observations of p variables, and you want to reduce the data so that each observation can be described with only L variables, L < p. Suppose further, that the data are arranged as a set of n data vectors $x_{1} \dots x_{n}$ with each $x_{i}$ representing a single grouped observation of the p variables.
- Write $x_{1} \dots x_{n}$ as row vectors, each with p elements.
- Place the row vectors into a single matrix X of dimensions n × p.
Calculate the empirical mean
- Find the empirical mean along each column j = 1, ..., p.
- Place the calculated mean values into an empirical mean vector u of dimensions p × 1. $u_{j} = \frac{1}{n} \sum_{i = 1}^{n} X_{i j}$
Calculate the deviations from the mean

Mean subtraction is an integral part of the solution towards finding a principal component basis that minimizes the mean square error of approximating the data. Hence we proceed by centering the data as follows:
- Subtract the empirical mean vector $u^{T}$ from each row of the data matrix X.
- Store mean-subtracted data in the n × p matrix B. $B = X - h u^{T}$ where h is an $n \times 1$ column vector of all 1s: $h_{i} = 1 for i = 1, \dots, n$
In some applications, each variable (column of B) may also be scaled to have a variance equal to 1 (see Z-score). This step affects the calculated principal components, but makes them independent of the units used to measure the different variables.
Find the covariance matrix
- Find the p × p empirical covariance matrix C from matrix B: $C = \frac{1}{n - 1} B^{*} B$ where $*$ is the conjugate transpose operator. If B consists entirely of real numbers, which is the case in many applications, the "conjugate transpose" is the same as the regular transpose.
- The reasoning behind using $n - 1$ instead of n to calculate the covariance is Bessel's correction.
Find the eigenvectors and eigenvalues of the covariance matrix
- Compute the matrix V of eigenvectors which diagonalizes the covariance matrix C: $V^{- 1} C V = D$ where D is the diagonal matrix of eigenvalues of C. This step will typically involve the use of a computer-based algorithm for computing eigenvectors and eigenvalues. These algorithms are readily available as sub-components of most matrix algebra systems, such as SAS, R, MATLAB, Mathematica, SciPy, IDL (Interactive Data Language), or GNU Octave as well as OpenCV.
- Matrix D will take the form of an p × p diagonal matrix, where $D_{k ℓ} = λ_{k} for k = ℓ$ is the jth eigenvalue of the covariance matrix C, and $D_{k ℓ} = 0 for k \neq ℓ .$
- Matrix V, also of dimension p × p, contains p column vectors, each of length p, which represent the p eigenvectors of the covariance matrix C.
- The eigenvalues and eigenvectors are ordered and paired. The jth eigenvalue corresponds to the jth eigenvector.
- Matrix V denotes the matrix of right eigenvectors (as opposed to left eigenvectors). In general, the matrix of right eigenvectors need not be the (conjugate) transpose of the matrix of left eigenvectors.
Rearrange the eigenvectors and eigenvalues
- Sort the columns of the eigenvector matrix V and eigenvalue matrix D in order of decreasing eigenvalue.
- Make sure to maintain the correct pairings between the columns in each matrix.
Compute the cumulative energy content for each eigenvector
- The eigenvalues represent the distribution of the source data's energyamong each of the eigenvectors, where the eigenvectors form a basis for the data. The cumulative energy content g for the jth eigenvector is the sum of the energy content across all of the eigenvalues from 1 through j: $g_{j} = \sum_{k = 1}^{j} D_{k k} for j = 1, \dots, p$
Select a subset of the eigenvectors as basis vectors
- Save the first L columns of V as the p × L matrix W: $W_{k l} = V_{k ℓ} for k = 1, \dots, p ℓ = 1, \dots, L$ where $1 \leq L \leq p .$
- Use the vector g as a guide in choosing an appropriate value for L. The goal is to choose a value of L as small as possible while achieving a reasonably high value of g on a percentage basis. For example, you may want to choose L so that the cumulative energy g is above a certain threshold, like 90 percent. In this case, choose the smallest value of L such that $\frac{g_{L}}{g_{p}} \geq 0.9$
Project the data onto the new basis
- The projected data points are the rows of the matrix $T = B \cdot W$
That is, the first column of $T$ $\mathbf {T}$ is the projection of the data points onto the first principal component, the second column is the projection onto the second principal component, etc.

Derivation using the covariance method

Let X be a d-dimensional random vector expressed as column vector. Without loss of generality, assume X has zero mean.

We want to find $(*)$ a $d \times d$ orthonormal transformation matrix P so that PX has a diagonal covariance matrix (that is, PX is a random vector with all its distinct components pairwise uncorrelated).

A quick computation assuming $P$ were unitary yields:

\begin{aligned} cov (P X) & = E [P X (P X)^{*}] \\ = E [P X X^{*} P^{*}] \\ = P E [X X^{*}] P^{*} \\ = P cov (X) P^{- 1} \end{aligned}

Hence $(*)$ holds if and only if $cov (X)$ were diagonalisable by $P$ .

This is very constructive, as cov(X) is guaranteed to be a non-negative definite matrix and thus is guaranteed to be diagonalisable by some unitary matrix.

Covariance-free computation

In practical implementations, especially with high dimensional data (large $p$ ), the naive covariance method is rarely used because it is not efficient due to high computational and memory costs of explicitly determining the covariance matrix. The covariance-free approach avoids the $np 2$ operations of explicitly calculating and storing the covariance matrix $X T X$ , instead utilizing one of matrix-free methods, for example, based on the function evaluating the product $X T (X r)$ at the cost of $2 np$ operations.

Iterative computation

One way to compute the first principal component efficiently is shown in the following pseudo-code, for a data matrix $X$ with zero mean, without ever computing its covariance matrix.

 $r$  = a random vector of length  $p$ 
r = r / norm(r)
do  $c$  times:
       $s = 0$  (a vector of length  $p$ )
      for each row x in X
            s = s + (x ⋅ r) x
      λ = r^Ts // λ is the eigenvalue
      error = |λ ⋅ r − s|
      r = s / norm(s)
      exit if error < tolerance
return λ, r

This power iteration algorithm simply calculates the vector $X T (X r)$ , normalizes, and places the result back in $r$ . The eigenvalue is approximated by $r T (X T X) r$ , which is the Rayleigh quotient on the unit vector $r$ for the covariance matrix $X T X$ . If the largest singular value is well separated from the next largest one, the vector $r$ gets close to the first principal component of $X$ within the number of iterations $c$ , which is small relative to $p$ , at the total cost $2cnp$ . The power iteration convergence can be accelerated without noticeably sacrificing the small cost per iteration using more advanced matrix-free methods, such as the Lanczos algorithm or the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) method.

Subsequent principal components can be computed one-by-one via deflation or simultaneously as a block. In the former approach, imprecisions in already computed approximate principal components additively affect the accuracy of the subsequently computed principal components, thus increasing the error with every new computation. The latter approach in the block power method replaces single-vectors $r$ and $s$ with block-vectors, matrices $R$ and $S$ . Every column of $R$ approximates one of the leading principal components, while all columns are iterated simultaneously. The main calculation is evaluation of the product $X T (X R)$ . Implemented, for example, in LOBPCG, efficient blocking eliminates the accumulation of the errors, allows using high-level BLAS matrix-matrix product functions, and typically leads to faster convergence, compared to the single-vector one-by-one technique.

The NIPALS method

Non-linear iterative partial least squares (NIPALS) is a variant the classical power iteration with matrix deflation by subtraction implemented for computing the first few components in a principal component or partial least squares analysis. For very-high-dimensional datasets, such as those generated in the *omics sciences (for example, genomics, metabolomics) it is usually only necessary to compute the first few PCs. The non-linear iterative partial least squares (NIPALS) algorithm updates iterative approximations to the leading scores and loadings t₁ and r₁^T by the power iteration multiplying on every iteration by X on the left and on the right, that is, calculation of the covariance matrix is avoided, just as in the matrix-free implementation of the power iterations to $X T X$ , based on the function evaluating the product $X T (X r) = ((X r) T X) T$ .

The matrix deflation by subtraction is performed by subtracting the outer product, t₁r₁^T from X leaving the deflated residual matrix used to calculate the subsequent leading PCs. For large data matrices, or matrices that have a high degree of column collinearity, NIPALS suffers from loss of orthogonality of PCs due to machine precision round-off errors accumulated in each iteration and matrix deflation by subtraction. A Gram–Schmidt re-orthogonalization algorithm is applied to both the scores and the loadings at each iteration step to eliminate this loss of orthogonality. NIPALS reliance on single-vector multiplications cannot take advantage of high-level BLAS and results in slow convergence for clustered leading singular values—both these deficiencies are resolved in more sophisticated matrix-free block solvers, such as the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) method.

Online/sequential estimation

In an "online" or "streaming" situation with data arriving piece by piece rather than being stored in a single batch, it is useful to make an estimate of the PCA projection that can be updated sequentially. This can be done efficiently, but requires different algorithms.

Qualitative variables

In PCA, it is common that we want to introduce qualitative variables as supplementary elements. For example, many quantitative variables have been measured on plants. For these plants, some qualitative variables are available as, for example, the species to which the plant belongs. These data were subjected to PCA for quantitative variables. When analyzing the results, it is natural to connect the principal components to the qualitative variable species. For this, the following results are produced.

Identification, on the factorial planes, of the different species, for example, using different colors.
Representation, on the factorial planes, of the centers of gravity of plants belonging to the same species.
For each center of gravity and each axis, p-value to judge the significance of the difference between the center of gravity and origin.

These results are what is called introducing a qualitative variable as supplementary element. This procedure is detailed in and Husson, Lê, & Pagès (2009) and Pagès (2013). Few software offer this option in an "automatic" way. This is the case of SPAD that historically, following the work of Ludovic Lebart, was the first to propose this option, and the R package FactoMineR.

Applications

Intelligence

The earliest application of factor analysis was in locating and measuring components of human intelligence. It was believed that intelligence had various uncorrelated components such as spatial intelligence, verbal intelligence, induction, deduction etc and that scores on these could be adduced by factor analysis from results on various tests, to give a single index known as the Intelligence Quotient (IQ). The pioneering statistical psychologist Spearman actually developed factor analysis in 1904 for his two-factor theory of intelligence, adding a formal technique to the science of psychometrics. In 1924 Thurstone looked for 56 factors of intelligence, developing the notion of Mental Age. Standard IQ tests today are based on this early work.

Residential differentiation

In 1949, Shevky and Williams introduced the theory of factorial ecology, which dominated studies of residential differentiation from the 1950s to the 1970s. Neighbourhoods in a city were recognizable or could be distinguished from one another by various characteristics which could be reduced to three by factor analysis. These were known as 'social rank' (an index of occupational status), 'familism' or family size, and 'ethnicity'; Cluster analysis could then be applied to divide the city into clusters or precincts according to values of the three key factor variables. An extensive literature developed around factorial ecology in urban geography, but the approach went out of fashion after 1980 as being methodologically primitive and having little place in postmodern geographical paradigms.

One of the problems with factor analysis has always been finding convincing names for the various artificial factors. In 2000, Flood revived the factorial ecology approach to show that principal components analysis actually gave meaningful answers directly, without resorting to factor rotation. The principal components were actually dual variables or shadow prices of 'forces' pushing people together or apart in cities. The first component was 'accessibility', the classic trade-off between demand for travel and demand for space, around which classical urban economics is based. The next two components were 'disadvantage', which keeps people of similar status in separate neighbourhoods (mediated by planning), and ethnicity, where people of similar ethnic backgrounds try to co-locate.

About the same time, the Australian Bureau of Statistics defined distinct indexes of advantage and disadvantage taking the first principal component of sets of key variables that were thought to be important. These SEIFA indexes are regularly published for various jurisdictions, and are used frequently in spatial analysis.

Development indexes

PCA can be used as a formal method for the development of indexes. As an alternative confirmatory composite analysis has been proposed to develop and assess indexes.

The City Development Index was developed by PCA from about 200 indicators of city outcomes in a 1996 survey of 254 global cities. The first principal component was subject to iterative regression, adding the original variables singly until about 90% of its variation was accounted for. The index ultimately used about 15 indicators but was a good predictor of many more variables. Its comparative value agreed very well with a subjective assessment of the condition of each city. The coefficients on items of infrastructure were roughly proportional to the average costs of providing the underlying services, suggesting the Index was actually a measure of effective physical and social investment in the city.

The country-level Human Development Index (HDI) from UNDP, which has been published since 1990 and is very extensively used in development studies, has very similar coefficients on similar indicators, strongly suggesting it was originally constructed using PCA.

Population genetics

In 1978 Cavalli-Sforza and others pioneered the use of principal components analysis (PCA) to summarise data on variation in human gene frequencies across regions. The components showed distinctive patterns, including gradients and sinusoidal waves. They interpreted these patterns as resulting from specific ancient migration events.

Since then, PCA has been ubiquitous in population genetics, with thousands of papers using PCA as a display mechanism. Genetics varies largely according to proximity, so the first two principal components actually show spatial distribution and may be used to map the relative geographical location of different population groups, thereby showing individuals who have wandered from their original locations.

PCA in genetics has been technically controversial, in that the technique has been performed on discrete non-normal variables and often on binary allele markers. The lack of any measures of standard error in PCA are also an impediment to more consistent usage. In August 2022, the molecular biologist Eran Elhaik published a theoretical paper in Scientific Reports analyzing 12 PCA applications. He concluded that it was easy to manipulate the method, which, in his view, generated results that were 'erroneous, contradictory, and absurd.' Specifically, he argued, the results achieved in population genetics were characterized by cherry-picking and circular reasoning.

Market research and indexes of attitude

Market research has been an extensive user of PCA. It is used to develop customer satisfaction or customer loyalty scores for products, and with clustering, to develop market segments that may be targeted with advertising campaigns, in much the same way as factorial ecology will locate geographical areas with similar characteristics.

PCA rapidly transforms large amounts of data into smaller, easier-to-digest variables that can be more rapidly and readily analyzed. In any consumer questionnaire, there are series of questions designed to elicit consumer attitudes, and principal components seek out latent variables underlying these attitudes. For example, the Oxford Internet Survey in 2013 asked 2000 people about their attitudes and beliefs, and from these analysts extracted four principal component dimensions, which they identified as 'escape', 'social networking', 'efficiency', and 'problem creating'.

Another example from Joe Flood in 2008 extracted an attitudinal index toward housing from 28 attitude questions in a national survey of 2697 households in Australia. The first principal component represented a general attitude toward property and home ownership. The index, or the attitude questions it embodied, could be fed into a General Linear Model of tenure choice. The strongest determinant of private renting by far was the attitude index, rather than income, marital status or household type.

Quantitative finance

In quantitative finance, PCA is used in financial risk management, and has been applied to other problems such as portfolio optimization.

PCA is commonly used in problems involving fixed income securities and portfolios, and interest rate derivatives. Valuations here depend on the entire yield curve, comprising numerous highly correlated instruments, and PCA is used to define a set of components or factors that explain rate movements, thereby facilitating the modelling. One common risk management application is to calculating value at risk, VaR, applying PCA to the Monte Carlo simulation. Here, for each simulation-sample, the components are stressed, and rates, and in turn option values, are then reconstructed; with VaR calculated, finally, over the entire run. PCA is also used in hedging exposure to interest rate risk, given partial durations and other sensitivities. Under both, the first three, typically, principal components of the system are of interest (representing "shift", "twist", and "curvature"). These principal components are derived from an eigen-decomposition of the covariance matrix of yield at predefined maturities; and where the variance of each component is its eigenvalue (and as the components are orthogonal, no correlation need be incorporated in subsequent modelling).

For equity, an optimal portfolio is one where the expected return is maximized for a given level of risk, or alternatively, where risk is minimized for a given return; see Markowitz model for discussion. Thus, one approach is to reduce portfolio risk, where allocation strategies are applied to the "principal portfolios" instead of the underlying stocks. A second approach is to enhance portfolio return, using the principal components to select companies' stocks with upside potential. PCA has also been used to understand relationships between international equity markets, and within markets between groups of companies in industries or sectors.

PCA may also be applied to stress testing, essentially an analysis of a bank's ability to endure a hypothetical adverse economic scenario. Its utility is in "distilling the information contained in [several] macroeconomic variables into a more manageable data set, which can then [be used] for analysis." Here, the resulting factors are linked to e.g. interest rates – based on the largest elements of the factor's eigenvector – and it is then observed how a "shock" to each of the factors affects the implied assets of each of the banks.

Neuroscience

A variant of principal components analysis is used in neuroscience to identify the specific properties of a stimulus that increases a neuron's probability of generating an action potential. This technique is known as spike-triggered covariance analysis. In a typical application an experimenter presents a white noise process as a stimulus (usually either as a sensory input to a test subject, or as a current injected directly into the neuron) and records a train of action potentials, or spikes, produced by the neuron as a result. Presumably, certain features of the stimulus make the neuron more likely to spike. In order to extract these features, the experimenter calculates the covariance matrix of the spike-triggered ensemble, the set of all stimuli (defined and discretized over a finite time window, typically on the order of 100 ms) that immediately preceded a spike. The eigenvectors of the difference between the spike-triggered covariance matrix and the covariance matrix of the prior stimulus ensemble (the set of all stimuli, defined over the same length time window) then indicate the directions in the space of stimuli along which the variance of the spike-triggered ensemble differed the most from that of the prior stimulus ensemble. Specifically, the eigenvectors with the largest positive eigenvalues correspond to the directions along which the variance of the spike-triggered ensemble showed the largest positive change compared to the variance of the prior. Since these were the directions in which varying the stimulus led to a spike, they are often good approximations of the sought after relevant stimulus features.

In neuroscience, PCA is also used to discern the identity of a neuron from the shape of its action potential. Spike sorting is an important procedure because extracellular recording techniques often pick up signals from more than one neuron. In spike sorting, one first uses PCA to reduce the dimensionality of the space of action potential waveforms, and then performs clustering analysis to associate specific action potentials with individual neurons.

PCA as a dimension reduction technique is particularly suited to detect coordinated activities of large neuronal ensembles. It has been used in determining collective variables, that is, order parameters, during phase transitions in the brain.

Relation with other methods

Correspondence analysis

Correspondence analysis (CA) was developed by Jean-Paul Benzécri and is conceptually similar to PCA, but scales the data (which should be non-negative) so that rows and columns are treated equivalently. It is traditionally applied to contingency tables. CA decomposes the chi-squared statistic associated to this table into orthogonal factors. Because CA is a descriptive technique, it can be applied to tables for which the chi-squared statistic is appropriate or not. Several variants of CA are available including detrended correspondence analysis and canonical correspondence analysis. One special extension is multiple correspondence analysis, which may be seen as the counterpart of principal component analysis for categorical data.

Factor analysis

Principal component analysis creates variables that are linear combinations of the original variables. The new variables have the property that the variables are all orthogonal. The PCA transformation can be helpful as a pre-processing step before clustering. PCA is a variance-focused approach seeking to reproduce the total variable variance, in which components reflect both common and unique variance of the variable. PCA is generally preferred for purposes of data reduction (that is, translating variable space into optimal factor space) but not when the goal is to detect the latent construct or factors.

Factor analysis is similar to principal component analysis, in that factor analysis also involves linear combinations of variables. Different from PCA, factor analysis is a correlation-focused approach seeking to reproduce the inter-correlations among variables, in which the factors "represent the common variance of variables, excluding unique variance". In terms of the correlation matrix, this corresponds with focusing on explaining the off-diagonal terms (that is, shared co-variance), while PCA focuses on explaining the terms that sit on the diagonal. However, as a side result, when trying to reproduce the on-diagonal terms, PCA also tends to fit relatively well the off-diagonal correlations. Results given by PCA and factor analysis are very similar in most situations, but this is not always the case, and there are some problems where the results are significantly different. Factor analysis is generally used when the research purpose is detecting data structure (that is, latent constructs or factors) or causal modeling. If the factor model is incorrectly formulated or the assumptions are not met, then factor analysis will give erroneous results.

$K$ -means clustering

It has been asserted that the relaxed solution of $k$ -means clustering, specified by the cluster indicators, is given by the principal components, and the PCA subspace spanned by the principal directions is identical to the cluster centroid subspace. However, that PCA is a useful relaxation of $k$ -means clustering was not a new result, and it is straightforward to uncover counterexamples to the statement that the cluster centroid subspace is spanned by the principal directions.

Non-negative matrix factorization

Non-negative matrix factorization (NMF) is a dimension reduction method where only non-negative elements in the matrices are used, which is therefore a promising method in astronomy, in the sense that astrophysical signals are non-negative. The PCA components are orthogonal to each other, while the NMF components are all non-negative and therefore constructs a non-orthogonal basis.

In PCA, the contribution of each component is ranked based on the magnitude of its corresponding eigenvalue, which is equivalent to the fractional residual variance (FRV) in analyzing empirical data. For NMF, its components are ranked based only on the empirical FRV curves. The residual fractional eigenvalue plots, that is, $1 - \sum_{i = 1}^{k} λ_{i} / \sum_{j = 1}^{n} λ_{j}$ as a function of component number $k$ given a total of $n$ components, for PCA have a flat plateau, where no data is captured to remove the quasi-static noise, then the curves drop quickly as an indication of over-fitting (random noise). The FRV curves for NMF is decreasing continuously when the NMF components are constructed sequentially, indicating the continuous capturing of quasi-static noise; then converge to higher levels than PCA, indicating the less over-fitting property of NMF.

Iconography of correlations

It is often difficult to interpret the principal components when the data include many variables of various origins, or when some variables are qualitative. This leads the PCA user to a delicate elimination of several variables. If observations or variables have an excessive impact on the direction of the axes, they should be removed and then projected as supplementary elements. In addition, it is necessary to avoid interpreting the proximities between the points close to the center of the factorial plane.

The iconography of correlations, on the contrary, which is not a projection on a system of axes, does not have these drawbacks. We can therefore keep all the variables.

The principle of the diagram is to underline the "remarkable" correlations of the correlation matrix, by a solid line (positive correlation) or dotted line (negative correlation).

A strong correlation is not "remarkable" if it is not direct, but caused by the effect of a third variable. Conversely, weak correlations can be "remarkable". For example, if a variable Y depends on several independent variables, the correlations of Y with each of them are weak and yet "remarkable".

Generalizations

Sparse PCA

A particular disadvantage of PCA is that the principal components are usually linear combinations of all input variables. Sparse PCA overcomes this disadvantage by finding linear combinations that contain just a few input variables. It extends the classic method of principal component analysis (PCA) for the reduction of dimensionality of data by adding sparsity constraint on the input variables. Several approaches have been proposed, including

a regression framework,
a convex relaxation/semidefinite programming framework
a generalized power method framework
an alternating maximization framework
forward-backward greedy search and exact methods using branch-and-bound techniques
Bayesian formulation framework.

The methodological and theoretical developments of Sparse PCA as well as its applications in scientific studies were recently reviewed in a survey paper.

Nonlinear PCA

Most of the modern methods for nonlinear dimensionality reduction find their theoretical and algorithmic roots in PCA or K-means. Pearson's original idea was to take a straight line (or plane) which will be "the best fit" to a set of data points. Trevor Hastie expanded on this concept by proposing Principal curves as the natural extension for the geometric interpretation of PCA, which explicitly constructs a manifold for data approximation followed by projecting the points onto it. See also the elastic map algorithm and principal geodesic analysis. Another popular generalization is kernel PCA, which corresponds to PCA performed in a reproducing kernel Hilbert space associated with a positive definite kernel.

In multilinear subspace learning,^[89]^[90]^[91] PCA is generalized to multilinear PCA (MPCA) that extracts features directly from tensor representations. MPCA is solved by performing PCA in each mode of the tensor iteratively. MPCA has been applied to face recognition, gait recognition, etc. MPCA is further extended to uncorrelated MPCA, non-negative MPCA and robust MPCA.

N-way principal component analysis may be performed with models such as Tucker decomposition, PARAFAC, multiple factor analysis, co-inertia analysis, STATIS, and DISTATIS.

Robust PCA

While PCA finds the mathematically optimal method (as in minimizing the squared error), it is still sensitive to outliers in the data that produce large errors, something that the method tries to avoid in the first place. It is therefore common practice to remove outliers before computing PCA. However, in some contexts, outliers can be difficult to identify. For example, in data mining algorithms like correlation clustering, the assignment of points to clusters and outliers is not known beforehand. A recently proposed generalization of PCA based on a weighted PCA increases robustness by assigning different weights to data objects based on their estimated relevancy.

Outlier-resistant variants of PCA have also been proposed, based on L1-norm formulations (L1-PCA).

Robust principal component analysis (RPCA) via decomposition in low-rank and sparse matrices is a modification of PCA that works well with respect to grossly corrupted observations.

Similar techniques

Independent component analysis

Independent component analysis (ICA) is directed to similar problems as principal component analysis, but finds additively separable components rather than successive approximations.

Network component analysis

Given a matrix $E$ , it tries to decompose it into two matrices such that $E = A P$ . A key difference from techniques such as PCA and ICA is that some of the entries of $A$ are constrained to be 0. Here $P$ is termed the regulatory layer. While in general such a decomposition can have multiple solutions, they prove that if the following conditions are satisfied :

$A$ has full column rank
Each column of $A$ must have at least $L - 1$ zeroes where $L$ is the number of columns of $A$ (or alternatively the number of rows of $P$ ). The justification for this criterion is that if a node is removed from the regulatory layer along with all the output nodes connected to it, the result must still be characterized by a connectivity matrix with full column rank.
$P$ must have full row rank.

then the decomposition is unique up to multiplication by a scalar.

Discriminant analysis of principal components

Discriminant analysis of principal components (DAPC) is a multivariate method used to identify and describe clusters of genetically related individuals. Genetic variation is partitioned into two components: variation between groups and within groups, and it maximizes the former. Linear discriminants are linear combinations of alleles which best separate the clusters. Alleles that most contribute to this discrimination are therefore those that are the most markedly different across groups. The contributions of alleles to the groupings identified by DAPC can allow identifying regions of the genome driving the genetic divergence among groups In DAPC, data is first transformed using a principal components analysis (PCA) and subsequently clusters are identified using discriminant analysis (DA).

A DAPC can be realized on R using the package Adegenet. (more info: adegenet on the web)

Directional component analysis

Directional component analysis (DCA) is a method used in the atmospheric sciences for analysing multivariate datasets. Like PCA, it allows for dimension reduction, improved visualization and improved interpretability of large data-sets. Also like PCA, it is based on a covariance matrix derived from the input dataset. The difference between PCA and DCA is that DCA additionally requires the input of a vector direction, referred to as the impact. Whereas PCA maximises explained variance, DCA maximises probability density given impact. The motivation for DCA is to find components of a multivariate dataset that are both likely (measured using probability density) and important (measured using the impact). DCA has been used to find the most likely and most serious heat-wave patterns in weather prediction ensembles , and the most likely and most impactful changes in rainfall due to climate change .

Software/source code

ALGLIB – a C++ and C# library that implements PCA and truncated PCA
Analytica – The built-in EigenDecomp function computes principal components.
ELKI – includes PCA for projection, including robust variants of PCA, as well as PCA-based clustering algorithms.
Gretl – principal component analysis can be performed either via the pca command or via the princomp() function.
Julia – Supports PCA with the pca function in the MultivariateStats package
KNIME – A java based nodal arranging software for Analysis, in this the nodes called PCA, PCA compute, PCA Apply, PCA inverse make it easily.
Maple (software) – The PCA command is used to perform a principal component analysis on a set of data.
Mathematica – Implements principal component analysis with the PrincipalComponents command using both covariance and correlation methods.
MathPHP – PHP mathematics library with support for PCA.
MATLAB – The SVD function is part of the basic system. In the Statistics Toolbox, the functions princomp and pca (R2012b) give the principal components, while the function pcares gives the residuals and reconstructed matrix for a low-rank PCA approximation.
Matplotlib – Python library have a PCA package in the .mlab module.
mlpack – Provides an implementation of principal component analysis in C++.
mrmath – A high performance math library for Delphi and FreePascal can perform PCA; including robust variants.
NAG Library – Principal components analysis is implemented via the g03aa routine (available in both the Fortran versions of the Library).
NMath – Proprietary numerical library containing PCA for the .NET Framework.
GNU Octave – Free software computational environment mostly compatible with MATLAB, the function princomp gives the principal component.
OpenCV
Oracle Database 12c – Implemented via DBMS_DATA_MINING.SVDS_SCORING_MODE by specifying setting value SVDS_SCORING_PCA
Orange (software) – Integrates PCA in its visual programming environment. PCA displays a scree plot (degree of explained variance) where user can interactively select the number of principal components.
Origin – Contains PCA in its Pro version.
Qlucore – Commercial software for analyzing multivariate data with instant response using PCA.
R – Free statistical package, the functions princomp and prcomp can be used for principal component analysis; prcomp uses singular value decomposition which generally gives better numerical accuracy. Some packages that implement PCA in R, include, but are not limited to: ade4, vegan, ExPosition, dimRed, and FactoMineR.
SAS – Proprietary software.
scikit-learn – Python library for machine learning which contains PCA, Probabilistic PCA, Kernel PCA, Sparse PCA and other techniques in the decomposition module.
Scilab – Free and open-source, cross-platform numerical computational package, the function princomp computes principal component analysis, the function pca computes principal component analysis with standardized variables.
SPSS – Proprietary software most commonly used by social scientists for PCA, factor analysis and associated cluster analysis.
Weka – Java library for machine learning which contains modules for computing principal components.

Mutagenesis

From Wikipedia, the free encyclopedia

Mutagenesis (/mjuːtəˈdʒɛnɪsɪs/) is a process by which the genetic information of an organism is changed by the production of a mutation. It may occur spontaneously in nature, or as a result of exposure to mutagens. It can also be achieved experimentally using laboratory procedures. A mutagen is a mutation-causing agent, be it chemical or physical, which results in an increased rate of mutations in an organism's genetic code. In nature mutagenesis can lead to cancer and various heritable diseases, and it is also a driving force of evolution. Mutagenesis as a science was developed based on work done by Hermann Muller, Charlotte Auerbach and J. M. Robson in the first half of the 20th century.

History

DNA may be modified, either naturally or artificially, by a number of physical, chemical and biological agents, resulting in mutations. Hermann Muller found that "high temperatures" have the ability to mutate genes in the early 1920s, and in 1927, demonstrated a causal link to mutation upon experimenting with an x-ray machine, noting phylogenetic changes when irradiating fruit flies with relatively high dose of X-rays.Muller observed a number of chromosome rearrangements in his experiments, and suggested mutation as a cause of cancer. The association of exposure to radiation and cancer had been observed as early as 1902, six years after the discovery of X-ray by Wilhelm Röntgen, and the discovery of radioactivity by Henri Becquerel. Lewis Stadler, Muller's contemporary, also showed the effect of X-rays on mutations in barley in 1928, and of ultraviolet (UV) radiation on maize in 1936. In 1940s, Charlotte Auerbach and J. M. Robson found that mustard gas can also cause mutations in fruit flies.

While changes to the chromosome caused by X-ray and mustard gas were readily observable to early researchers, other changes to the DNA induced by other mutagens were not so easily observable; the mechanism by which they occur may be complex, and take longer to unravel. For example, soot was suggested to be a cause of cancer as early as 1775, and coal tar was demonstrated to cause cancer in 1915. The chemicals involved in both were later shown to be polycyclic aromatic hydrocarbons (PAH). PAHs by themselves are not carcinogenic, and it was proposed in 1950 that the carcinogenic forms of PAHs are the oxides produced as metabolites from cellular processes. The metabolic process was identified in 1960s as catalysis by cytochrome P450, which produces reactive species that can interact with the DNA to form adducts, or product molecules resulting from the reaction of DNA and, in this case, cytochrome P450; the mechanism by which the PAH adducts give rise to mutation, however, is still under investigation.

Distinction between a mutation and DNA damage

DNA damage is an abnormal alteration in the structure of DNA that cannot, itself, be replicated when DNA replicates. In contrast, a mutation is a change in the nucleic acid sequence that can be replicated; hence, a mutation can be inherited from one generation to the next. Damage can occur from chemical addition (adduct), or structural disruption to a base of DNA (creating an abnormal nucleotide or nucleotide fragment), or a break in one or both DNA strands. Such DNA damage may result in mutation. When DNA containing damage is replicated, an incorrect base may be inserted in the new complementary strand as it is being synthesized (see DNA repair § Translesion synthesis). The incorrect insertion in the new strand will occur opposite the damaged site in the template strand, and this incorrect insertion can become a mutation (i.e. a changed base pair) in the next round of replication. Furthermore, double-strand breaks in DNA may be repaired by an inaccurate repair process, non-homologous end joining, which produces mutations. Mutations can ordinarily be avoided if accurate DNA repair systems recognize DNA damage and repair it prior to completion of the next round of replication. At least 169 enzymes are either directly employed in DNA repair or influence DNA repair processes. Of these, 83 are directly employed in the 5 types of DNA repair processes indicated in the chart shown in the article DNA repair.

Mammalian nuclear DNA may sustain more than 60,000 damage episodes per cell per day, as listed with references in DNA damage (naturally occurring). If left uncorrected, these adducts, after misreplication past the damaged sites, can give rise to mutations. In nature, the mutations that arise may be beneficial or deleterious—this is the driving force of evolution. An organism may acquire new traits through genetic mutation, but mutation may also result in impaired function of the genes and, in severe cases, causes the death of the organism. Mutation is also a major source for acquisition of resistance to antibiotics in bacteria, and to antifungal agents in yeasts and molds. In a laboratory setting, mutagenesis is a useful technique for generating mutations that allows the functions of genes and gene products to be examined in detail, producing proteins with improved characteristics or novel functions, as well as mutant strains with useful properties. Initially, the ability of radiation and chemical mutagens to cause mutation was exploited to generate random mutations, but later techniques were developed to introduce specific mutations.

In humans, an average of 60 new mutations are transmitted from parent to offspring. Human males, however, tend to pass on more mutations depending on their age, transmitting an average of two new mutations to their progeny with every additional year of their age.

Mechanisms

Mutagenesis may occur endogenously (e.g. spontaneous hydrolysis), through normal cellular processes that can generate reactive oxygen species and DNA adducts, or through error in DNA replication and repair. Mutagenesis may also occur as a result of the presence of environmental mutagens that induce changes to an organism's DNA. The mechanism by which mutation occurs varies according to the mutagen, or the causative agent, involved. Most mutagens act either directly, or indirectly via mutagenic metabolites, on an organism's DNA, producing lesions. Some mutagens, however, may affect the replication or chromosomal partition mechanism, and other cellular processes.

Mutagenesis may also be self-induced by unicellular organisms when environmental conditions are restrictive to the organism's growth, such as bacteria growing in the presence of antibiotics, yeast growing in the presence of an antifungal agent, or other unicellular organisms growing in an environment lacking in an essential nutrient.

Many chemical mutagens require biological activation to become mutagenic. An important group of enzymes involved in the generation of mutagenic metabolites is cytochrome P450. Other enzymes that may also produce mutagenic metabolites include glutathione S-transferase and microsomal epoxide hydrolase. Mutagens that are not mutagenic by themselves but require biological activation are called promutagens.

While most mutagens produce effects that ultimately result in errors in replication, for example creating adducts that interfere with replication, some mutagens may directly affect the replication process or reduce its fidelity. Base analog such as 5-bromouracil may substitute for thymine in replication. Metals such as cadmium, chromium, and nickel can increase mutagenesis in a number of ways in addition to direct DNA damage, for example reducing the ability to repair errors, as well as producing epigenetic changes.

Mutations often arise as a result of problems caused by DNA lesions during replication, resulting in errors in replication. In bacteria, extensive damage to DNA due to mutagens results in single-stranded DNA gaps during replication. This induces the SOS response, an emergency repair process that is also error-prone, thereby generating mutations. In mammalian cells, stalling of replication at damaged sites induces a number of rescue mechanisms that help bypass DNA lesions, however, this may also result in errors. The Y family of DNA polymerases specializes in DNA lesion bypass in a process termed translesion synthesis (TLS) whereby these lesion-bypass polymerases replace the stalled high-fidelity replicative DNA polymerase, transit the lesion and extend the DNA until the lesion has been passed so that normal replication can resume; these processes may be error-prone or error-free.

DNA damage and spontaneous mutation

The number of DNA damage episodes occurring in a mammalian cell per day is high (more than 60,000 per day). Frequent occurrence of DNA damage is likely a problem for all DNA- containing organisms, and the need to cope with DNA damage and minimize their deleterious effects is likely a fundamental problem for life.

Most spontaneous mutations likely arise from error-prone trans-lesion synthesis past a DNA damage site in the template strand during DNA replication. This process can overcome potentially lethal blockages, but at the cost of introducing inaccuracies in daughter DNA. The causal relationship of DNA damage to spontaneous mutation is illustrated by aerobically growing E. coli bacteria, in which 89% of spontaneously occurring base substitution mutations are caused by reactive oxygen species (ROS)-induced DNA damage. In yeast, more than 60% of spontaneous single-base pair substitutions and deletions are likely caused by trans-lesion synthesis.

An additional significant source of mutations in eukaryotes is the inaccurate DNA repair process non-homologous end joining, that is often employed in repair of double strand breaks.

In general, it appears that the main underlying cause of spontaneous mutation is error-prone trans-lesion synthesis during DNA replication and that the error-prone non-homologous end-joining repair pathway may also be an important contributor in eukaryotes.

Spontaneous hydrolysis

DNA is not entirely stable in aqueous solution, and depurination of the DNA can occur. Under physiological conditions the glycosidic bond may be hydrolyzed spontaneously and 10,000 purine sites in DNA are estimated to be depurinated each day in a cell. Numerous DNA repair pathways exist for DNA; however, if the apurinic site is not repaired, misincorporation of nucleotides may occur during replication. Adenine is preferentially incorporated by DNA polymerases in an apurinic site.

Cytidine may also become deaminated to uridine at one five-hundredth of the rate of depurination and can result in G to A transition. Eukaryotic cells also contain 5-methylcytosine, thought to be involved in the control of gene transcription, which can become deaminated into thymine.

Tautomerism

Tautomerization is the process by which compounds spontaneously rearrange themselves to assume their structural isomer forms. For example, the keto (C=O) forms of guanine and thymine can rearrange into their rare enol (-OH) forms, while the amino (-NH₂ ) forms of adenine and cytosine can result in the rarer imino (=NH) forms. In DNA replication, tautomerization alters the base-pairing sites and can cause the improper pairing of nucleic acid bases.

Modification of bases

Bases may be modified endogenously by normal cellular molecules. For example, DNA may be methylated by S-adenosylmethionine, thus altering the expression of the marked gene without incurring a mutation to the DNA sequence itself. Histone modification is a related process in which the histone proteins around which DNA coils can be similarly modified via methylation, phosphorylation, or acetylation; these modifications may act to alter gene expression of the local DNA, and may also act to denote locations of damaged DNA in need of repair. DNA may also be glycosylated by reducing sugars.

Many compounds, such as PAHs, aromatic amines, aflatoxin and pyrrolizidine alkaloids, may form reactive oxygen species catalyzed by cytochrome P450. These metabolites form adducts with the DNA, which can cause errors in replication, and the bulky aromatic adducts may form stable intercalation between bases and block replication. The adducts may also induce conformational changes in the DNA. Some adducts may also result in the depurination of the DNA; it is, however, uncertain how significant such depurination as caused by the adducts is in generating mutation.

Alkylation and arylation of bases can cause errors in replication. Some alkylating agents such as N-Nitrosamines may require the catalytic reaction of cytochrome-P450 for the formation of a reactive alkyl cation. N⁷ and O⁶ of guanine and the N³ and N⁷ of adenine are most susceptible to attack. N⁷-guanine adducts form the bulk of DNA adducts, but they appear to be non-mutagenic. Alkylation at O⁶ of guanine, however, is harmful because excision repair of O⁶-adduct of guanine may be poor in some tissues such as the brain. The O⁶ methylation of guanine can result in G to A transition, while O⁴-methylthymine can be mispaired with guanine. The type of the mutation generated, however, may be dependent on the size and type of the adduct as well as the DNA sequence.

Ionizing radiation and reactive oxygen species often oxidize guanine to produce 8-oxoguanine.

Arrows indicates chromosomal breakages due to DNA damage.

Backbone damage

Ionizing radiation may produce highly reactive free radicals that can break the bonds in the DNA. Double-stranded breakages are especially damaging and hard to repair, producing translocation and deletion of part of a chromosome. Alkylating agents like mustard gas may also cause breakages in the DNA backbone. Oxidative stress may also generate highly reactive oxygen species that can damage DNA. Incorrect repair of other damage induced by the highly reactive species can also lead to mutations.

Crosslinking

Covalent bonds between the bases of nucleotides in DNA, be they in the same strand or opposing strands, is referred to as crosslinking of DNA; crosslinking of DNA may affect both the replication and the transcription of DNA, and it may be caused by exposure to a variety of agents. Some naturally occurring chemicals may also promote crosslinking, such as psoralens after activation by UV radiation, and nitrous acid. Interstrand cross-linking (between two strands) causes more damage, as it blocks replication and transcription and can cause chromosomal breakages and rearrangements. Some crosslinkers such as cyclophosphamide, mitomycin C and cisplatin are used as anticancer chemotherapeutic because of their high degree of toxicity to proliferating cells.

Dimerization

Dimerization consists of the bonding of two monomers to form an oligomer, such as the formation of pyrimidine dimers as a result of exposure to UV radiation, which promotes the formation of a cyclobutyl ring between adjacent thymines in DNA. In human skin cells, thousands of dimers may be formed in a day due to normal exposure to sunlight. DNA polymerase η may help bypass these lesions in an error-free manner; however, individuals with defective DNA repair function, such as those with xeroderma pigmentosum, are sensitive to sunlight and may be prone to skin cancer.

Clinically, whether a tumor has formed as a direct consequence of UV radiation is discernible via DNA sequencing analysis for the characteristic context-specific dimerization pattern that occurs due to excessive exposure to sunlight.

Intercalation between bases

The planar structure of chemicals such as ethidium bromide and proflavine allows them to insert between bases in DNA. This insert causes the DNA's backbone to stretch and makes slippage in DNA during replication more likely to occur since the bonding between the strands is made less stable by the stretching. Forward slippage will result in deletion mutation, while reverse slippage will result in an insertion mutation. Also, the intercalation into DNA of anthracyclines such as daunorubicin and doxorubicin interferes with the functioning of the enzyme topoisomerase II, blocking replication as well as causing mitotic homologous recombination.

Insertional mutagenesis

Transposons and viruses or retrotransposons may insert DNA sequences into coding regions or functional elements of a gene and result in inactivation of the gene.

Adaptive mutagenesis mechanisms

Adaptive mutagenesis has been defined as mutagenesis mechanisms that enable an organism to adapt to an environmental stress. Since the variety of environmental stresses is very broad, the mechanisms that enable it are also quite broad, as far as research on the field has shown. For instance, in bacteria, while modulation of the SOS response and endogenous prophage DNA synthesis has been shown to increase Acinetobacter baumannii resistance to ciprofloxacin. Resistance mechanisms are presumed to be linked to chromosomal mutation untransferable via horizontal gene transfer in some members of family Enterobacteriaceae, such as E. coli, Salmonella spp., Klebsiella spp., and Enterobacter spp. Chromosomal events, specially gene amplification, seem also to be relevant to this adaptive mutagenesis in bacteria.

Research in eukaryotic cells is much scarcer, but chromosomal events seem also to be rather relevant: while an ectopic intrachromosomal recombination has been reported to be involved in acquisition of resistance to 5-fluorocytosine in Saccharomyces cerevisiae, genome duplications have been found to confer resistance in S. cerevisiae to nutrient-poor environments.

Laboratory applications

In the laboratory, mutagenesis is a technique by which DNA mutations are deliberately engineered to produce mutant genes, proteins, or strains of organisms. Various constituents of a gene, such as its control elements and its gene product, may be mutated so that the function of a gene or protein can be examined in detail. The mutation may also produce mutant proteins with altered properties, or enhanced or novel functions that may prove to be of use commercially. Mutant strains of organisms that have practical applications, or allow the molecular basis of particular cell function to be investigated, may also be produced.

Early methods of mutagenesis produced entirely random mutations; however, modern methods of mutagenesis are capable of producing site-specific mutations. Modern laboratory techniques used to generate these mutations include:

Regularization (physics)

From Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Regularization_(physics)

In physics, especially quantum field theory, regularization is a method of modifying observables which have singularities in order to make them finite by the introduction of a suitable parameter called the regulator. The regulator, also known as a "cutoff", models our lack of knowledge about physics at unobserved scales (e.g. scales of small size or large energy levels). It compensates for (and requires) the possibility of separation of scales that "new physics" may be discovered at those scales which the present theory is unable to model, while enabling the current theory to give accurate predictions as an "effective theory" within its intended scale of use.

It is distinct from renormalization, another technique to control infinities without assuming new physics, by adjusting for self-interaction feedback.

Regularization was for many decades controversial even amongst its inventors, as it combines physical and epistemological claims into the same equations. However, it is now well understood and has proven to yield useful, accurate predictions.

Overview

Regularization procedures deal with infinite, divergent, and nonsensical expressions by introducing an auxiliary concept of a regulator (for example, the minimal distance $ϵ$ in space which is useful, in case the divergences arise from short-distance physical effects). The correct physical result is obtained in the limit in which the regulator goes away (in our example, $ϵ \to 0$ ), but the virtue of the regulator is that for its finite value, the result is finite.

However, the result usually includes terms proportional to expressions like $1 / ϵ$ which are not well-defined in the limit $ϵ \to 0$ . Regularization is the first step towards obtaining a completely finite and meaningful result; in quantum field theory it must be usually followed by a related, but independent technique called renormalization. Renormalization is based on the requirement that some physical quantities — expressed by seemingly divergent expressions such as $1 / ϵ$ — are equal to the observed values. Such a constraint allows one to calculate a finite value for many other quantities that looked divergent.

The existence of a limit as ε goes to zero and the independence of the final result from the regulator are nontrivial facts. The underlying reason for them lies in universality as shown by Kenneth Wilson and Leo Kadanoff and the existence of a second order phase transition. Sometimes, taking the limit as ε goes to zero is not possible. This is the case when we have a Landau pole and for nonrenormalizable couplings like the Fermi interaction. However, even for these two examples, if the regulator only gives reasonable results for $ϵ ≫ ℏ c / Λ$ (where $Λ$ is a superior energy cuttoff) and we are working with scales of the order of $ℏ c / Λ^{'}$ , regulators with $ℏ c / Λ ≪ ϵ ≪ ℏ c / Λ^{'}$ still give pretty accurate approximations. The physical reason why we can't take the limit of ε going to zero is the existence of new physics below Λ.

It is not always possible to define a regularization such that the limit of ε going to zero is independent of the regularization. In this case, one says that the theory contains an anomaly. Anomalous theories have been studied in great detail and are often founded on the celebrated Atiyah–Singer index theorem or variations thereof (see, for example, the chiral anomaly).

Classical physics example

The problem of infinities first arose in the classical electrodynamics of point particles in the 19th and early 20th century.

The mass of a charged particle should include the mass–energy in its electrostatic field (electromagnetic mass). Assume that the particle is a charged spherical shell of radius $r e$ . The mass–energy in the field is

m_{e m} = \int \frac{1}{2} E^{2} d V = \int_{r_{e}}^{\infty} \frac{1}{2} {(\frac{q}{4 π r^{2}})}^{2} 4 π r^{2} d r = \frac{q^{2}}{8 π r_{e}},

which becomes infinite as $r e \to 0$ . This implies that the point particle would have infinite inertia, making it unable to be accelerated. Incidentally, the value of $r e$ that makes $m_{e m}$ equal to the electron mass is called the classical electron radius, which (setting $q = e$ and restoring factors of $c$ and $ε_{0}$ ) turns out to be

r_{e} = \frac{e^{2}}{4 π ε_{0} m_{e} c^{2}} = α \frac{ℏ}{m_{e} c} \approx 2.8 \times 10^{- 15} m .

where $α \approx 1 / 137.040$ is the fine-structure constant, and $ℏ / m_{e} c$ is the Compton wavelength of the electron.

Regularization: Classical physics theory breaks down at small scales, e.g., the difference between an electron and a point particle shown above. Addressing this problem requires new kinds of additional physical constraints. For instance, in this case, assuming a finite electron radius (i.e., regularizing the electron mass-energy) suffices to explain the system below a certain size. Similar regularization arguments work in other renormalization problems. For example, a theory may hold under one narrow set of conditions, but due to calculations involving infinities or singularities, it may breakdown under other conditions or scales. In the case of the electron, another way to avoid infinite mass-energy while retaining the point nature of the particle is to postulate tiny additional dimensions over which the particle could 'spread out' rather than restrict its motion solely over 3D space. This is precisely the motivation behind string theory and other multi-dimensional models including multiple time dimensions. Rather than the existence of unknown new physics, assuming the existence of particle interactions with other surrounding particles in the environment, renormalization offers an alternative strategy to resolve infinities in such classical problems.

Specific types

Specific types of regularization procedures include

Realistic regularization

Conceptual problem

Perturbative predictions by quantum field theory about quantum scattering of elementary particles, implied by a corresponding Lagrangian density, are computed using the Feynman rules, a regularization method to circumvent ultraviolet divergences so as to obtain finite results for Feynman diagrams containing loops, and a renormalization scheme. Regularization method results in regularized n-point Green's functions (propagators), and a suitable limiting procedure (a renormalization scheme) then leads to perturbative S-matrix elements. These are independent of the particular regularization method used, and enable one to model perturbatively the measurable physical processes (cross sections, probability amplitudes, decay widths and lifetimes of excited states). However, so far no known regularized n-point Green's functions can be regarded as being based on a physically realistic theory of quantum-scattering since the derivation of each disregards some of the basic tenets of conventional physics (e.g., by not being Lorentz-invariant, by introducing either unphysical particles with a negative metric or wrong statistics, or discrete space-time, or lowering the dimensionality of space-time, or some combination thereof). So the available regularization methods are understood as formalistic technical devices, devoid of any direct physical meaning. In addition, there are qualms about renormalization. For a history and comments on this more than half-a-century old open conceptual problem, see e.g.

Pauli's conjecture

As it seems that the vertices of non-regularized Feynman series adequately describe interactions in quantum scattering, it is taken that their ultraviolet divergences are due to the asymptotic, high-energy behavior of the Feynman propagators. So it is a prudent, conservative approach to retain the vertices in Feynman series, and modify only the Feynman propagators to create a regularized Feynman series. This is the reasoning behind the formal Pauli–Villars covariant regularization by modification of Feynman propagators through auxiliary unphysical particles, cf. and representation of physical reality by Feynman diagrams.

In 1949 Pauli conjectured there is a realistic regularization, which is implied by a theory that respects all the established principles of contemporary physics. So its propagators (i) do not need to be regularized, and (ii) can be regarded as such a regularization of the propagators used in quantum field theories that might reflect the underlying physics. The additional parameters of such a theory do not need to be removed (i.e. the theory needs no renormalization) and may provide some new information about the physics of quantum scattering, though they may turn out experimentally to be negligible. By contrast, any present regularization method introduces formal coefficients that must eventually be disposed of by renormalization.

Opinions

Paul Dirac was persistently, extremely critical about procedures of renormalization. In 1963, he wrote, "… in the renormalization theory we have a theory that has defied all the attempts of the mathematician to make it sound. I am inclined to suspect that the renormalization theory is something that will not survive in the future,…" He further observed that "One can distinguish between two main procedures for a theoretical physicist. One of them is to work from the experimental basis ... The other procedure is to work from the mathematical basis. One examines and criticizes the existing theory. One tries to pin-point the faults in it and then tries to remove them. The difficulty here is to remove the faults without destroying the very great successes of the existing theory."

Abdus Salam remarked in 1972, "Field-theoretic infinities first encountered in Lorentz's computation of electron have persisted in classical electrodynamics for seventy and in quantum electrodynamics for some thirty-five years. These long years of frustration have left in the subject a curious affection for the infinities and a passionate belief that they are an inevitable part of nature; so much so that even the suggestion of a hope that they may after all be circumvented - and finite values for the renormalization constants computed - is considered irrational."

However, in Gerard ’t Hooft’s opinion, "History tells us that if we hit upon some obstacle, even if it looks like a pure formality or just a technical complication, it should be carefully scrutinized. Nature might be telling us something, and we should find out what it is."

The difficulty with a realistic regularization is that so far there is none, although nothing could be destroyed by its bottom-up approach; and there is no experimental basis for it.

Minimal realistic regularization

Considering distinct theoretical problems, Dirac in 1963 suggested: "I believe separate ideas will be needed to solve these distinct problems and that they will be solved one at a time through successive stages in the future evolution of physics. At this point I find myself in disagreement with most physicists. They are inclined to think one master idea will be discovered that will solve all these problems together. I think it is asking too much to hope that anyone will be able to solve all these problems together. One should separate them one from another as much as possible and try to tackle them separately. And I believe the future development of physics will consist of solving them one at a time, and that after any one of them has been solved there will still be a great mystery about how to attack further ones."

According to Dirac, "Quantum electrodynamics is the domain of physics that we know most about, and presumably it will have to be put in order before we can hope to make any fundamental progress with other field theories, although these will continue to develop on the experimental basis."

Dirac’s two preceding remarks suggest that we should start searching for a realistic regularization in the case of quantum electrodynamics (QED) in the four-dimensional Minkowski spacetime, starting with the original QED Lagrangian density.

The path-integral formulation provides the most direct way from the Lagrangian density to the corresponding Feynman series in its Lorentz-invariant form. The free-field part of the Lagrangian density determines the Feynman propagators, whereas the rest determines the vertices. As the QED vertices are considered to adequately describe interactions in QED scattering, it makes sense to modify only the free-field part of the Lagrangian density so as to obtain such regularized Feynman series that the Lehmann–Symanzik–Zimmermann reduction formula provides a perturbative S-matrix that: (i) is Lorentz-invariant and unitary; (ii) involves only the QED particles; (iii) depends solely on QED parameters and those introduced by the modification of the Feynman propagators—for particular values of these parameters it is equal to the QED perturbative S-matrix; and (iv) exhibits the same symmetries as the QED perturbative S-matrix. Let us refer to such a regularization as the minimal realistic regularization, and start searching for the corresponding, modified free-field parts of the QED Lagrangian density.

Transport theoretic approach

According to Bjorken and Drell, it would make physical sense to sidestep ultraviolet divergences by using more detailed description than can be provided by differential field equations. And Feynman noted about the use of differential equations: "... for neutron diffusion it is only an approximation that is good when the distance over which we are looking is large compared with the mean free path. If we looked more closely, we would see individual neutrons running around." And then he wondered, "Could it be that the real world consists of little X-ons which can be seen only at very tiny distances? And that in our measurements we are always observing on such a large scale that we can’t see these little X-ons, and that is why we get the differential equations? ... Are they [therefore] also correct only as a smoothed-out imitation of a really much more complicated microscopic world?"

Already in 1938, Heisenberg proposed that a quantum field theory can provide only an idealized, large-scale description of quantum dynamics, valid for distances larger than some fundamental length, expected also by Bjorken and Drell in 1965. Feynman's preceding remark provides a possible physical reason for its existence; either that or it is just another way of saying the same thing (there is a fundamental unit of distance) but having no new information.

Hints at new physics

The need for regularization terms in any quantum field theory of quantum gravity is a major motivation for physics beyond the standard model. Infinities of the non-gravitational forces in QFT can be controlled via renormalization only but additional regularization - and hence new physics—is required uniquely for gravity. The regularizers model, and work around, the breakdown of QFT at small scales and thus show clearly the need for some other theory to come into play beyond QFT at these scales. A. Zee (Quantum Field Theory in a Nutshell, 2003) considers this to be a benefit of the regularization framework—theories can work well in their intended domains but also contain information about their own limitations and point clearly to where new physics is needed.

Search This Blog

Sunday, March 23, 2025

Principal component analysis

Overview

History

Intuition

Details

First component

Further components

Covariances

Dimensionality reduction

Singular value decomposition

Further considerations

Table of symbols and abbreviations

Properties and limitations

Properties

Limitations

PCA and information theory

Computation using the covariance method

Derivation using the covariance method

Covariance-free computation

Iterative computation

The NIPALS method

Online/sequential estimation

Qualitative variables

Applications

Intelligence

Residential differentiation

Development indexes

Population genetics

Market research and indexes of attitude

Quantitative finance

Neuroscience

Relation with other methods

Correspondence analysis

Factor analysis

K-means clustering

Non-negative matrix factorization

Iconography of correlations

Generalizations

Sparse PCA

Nonlinear PCA

Robust PCA

Similar techniques

Independent component analysis

Network component analysis

Discriminant analysis of principal components

Directional component analysis

Software/source code

Mutagenesis

History

Distinction between a mutation and DNA damage

Mechanisms

DNA damage and spontaneous mutation

Spontaneous hydrolysis

Tautomerism

Modification of bases

Backbone damage

Crosslinking

Dimerization

Intercalation between bases

Insertional mutagenesis

Adaptive mutagenesis mechanisms

Laboratory applications

Regularization (physics)

Overview

Classical physics example

Specific types

Realistic regularization

Conceptual problem

Pauli's conjecture

Opinions

Minimal realistic regularization

Transport theoretic approach

Hints at new physics

Logical positivism

$K$ -means clustering