A dimensionality reduction method for high-dimensional circular data is developed, which is based on a principal component analysis (PCA) of data points on a torus. Adopting a geometrical view of PCA, various distance measures on a torus are introduced and the associated problem of projecting data onto the principal subspaces is discussed. The main idea is that the (periodicity-induced) projection error can be minimized by transforming the data such that the maximal gap of the sampling is shifted to the periodic boundary. In a second step, the covariance matrix and its eigendecomposition can be computed in a standard manner. Adopting molecular dynamics simulations of two well-established biomolecular systems (Aib9 and villin headpiece), the potential of the method to analyze the dynamics of backbone dihedral angles is demonstrated. The new approach allows for a robust and well-defined construction of metastable states and provides low-dimensional reaction coordinates that accurately describe the free energy landscape. Moreover, it offers a direct interpretation of covariances and principal components in terms of the angular variables. Apart from its application to PCA, the method of maximal gap shifting is general and can be applied to any other dimensionality reduction method for circular data.
REFERENCES
The projection of the data points onto this main principal component axis destroys any properties of “neighborhood,” i.e., two points which are very close to each other on the torus (i.e., the data space) may be arbitrarily far apart from each other when projected (according to closest distance) onto this axis. Furthermore, on this principal component the data points will in general be distributed over an infinite length.
Due to a lack of rigor, in particular with respect to notation, it is not really clear from the article how exactly GeoPCA is performed. Some formulae of Ref. 18 refer to scalar products of D-dimensional vectors of data points with (D + 1)-dimensional vectors in the embedding space, which is clearly not what the authors had in mind. The existing computer program for this analysis53 seems to be based on the usual representation of a sphere by generalizations of Euler angles. However, the restriction to principal dimensions being great circles (geodesics) will yield satisfactory results only in very special cases of data structures.