Tuesday, April 24, 2018

PCA SOM and GTM

Excerpt from "Chemical Data Visualization and Analysis with Incremental Generative Topographic Mapping: Big Data Challenge", https://pubs.acs.org/doi/full/10.1021/ci500575y?src=recsys
 
However, both PCA and SOM have some clear drawbacks. PCA, as a linear method of dimensionality reduction, may process poorly nonlinear data. In some cases, a small number of principal components explains only a small part of data variance. As noted by Bengio et al.(16) “the expressive power of linear features is very limited: they cannot be stacked to form deeper, more abstract representations since the composition of linear operations yields another linear operation”. This hampers drastically the ability of PCA to reveal disentangled factors responsible for data variation, especially in the case of Big Data. Another problem comes from the low information richness of PCA plots, resulting from the tendency to concentrate most of the data points in a certain region in the form of a Gaussian cloud, while leaving the rest of the plot poorly populated.(17) This behavior could be explained with the help of the probabilistic interpretation of PCA, which casts it as a factor analysis based on a single multivariate normal distribution function.(18)
 
SOM is a nonlinear dimensionality reduction method. Due to its topology-preserving character, SOM provides more information-rich plots than PCA. However, SOM suffers of its purely empirical nature and lacks solid statistical foundations.(19) As a result, the output information is truncated to the assignment of a molecule into its residence node, and the indication of how well it fits into this node. SOM tools, by default, would not report whether other nodes might have hosted a molecule as well, at only slightly higher quantization errors (mean dissimilarity between each molecule and the code vectors of its residence neuron). Since SOM does not define any probability distribution function, any powerful tool of statistical analysis and inference cannot be applied. The training algorithm for SOM does not optimize an objective function(20) and, therefore, does not guarantee convergence. The choice of SOM parameters (learning rate and width of neighborhood functions) proceeds essentially in an empirical manner, without any statistical justification.
 
The above is the key issue prompting Bishop et al.(21) to suggest generative topographic mapping (GTM) as a probabilistic extension of SOM. GTM overcomes most of the limitations of SOMs without introducing disadvantages. GTM is a probabilistic topology-preserving dimensionality reduction method,(21) which projects the D-dimensional chemical space onto a two-dimensional space. It has been shown that GTM could be used not only as a chemical data visualization tool(17, 22, 23) but also to build classification(17, 22) and regression(24) structure–property models.