Science Notes

Bootstrap:docker
From:ubuntu:latest
%post
    apt update
    apt -y install openssh-server
    apt -y install openssh-client
    apt -y install build-essential
    apt -y install less
    apt -y install nano
    TZ="America/New_York"
    ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
    apt -y install tzdata
    apt -y install locales
    locale-gen --purge en_US.UTF-8
    # install R
    apt -y install --no-install-recommends software-properties-common dirmngr
    wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc
    add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/"
    apt -y install --no-install-recommends r-base
%environment
    export LC_ALL=C.UTF-8
    export LANG=C.UTF-8

Let us illustrate PCA in R programming language. As an example, we will use a built-in dataset mtcars (samples in the rows and variables in the columns).

> head(mtcars)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Scale data before PCA.

> X <- scale(mtcars)

Call function prcomp to compute PCA.

> result1 <- prcomp(X, retx = TRUE, center = FALSE, scale. = FALSE)

Object result1 is a list with several components:

`sdev`	the standard deviations of the principal components (i.e., the square roots of the eigenvalues of the covariance/correlation matrix).
`rotation`	the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors).
`x`	the PCA scores or the rotated data (data multiplied by the `rotation` matrix), which is the projection of the data on the eigenvectors.
`center, scale`	the centering and scaling used, or `FALSE`.

Next, we compute SVD of data X.

> result2 <- svd(X)

Object result2 is a list with several components:

`d`	a vector containing the singular values of `X`.
`u`	a matrix whose columns contain the left singular vectors of `x`.
`v`	a matrix whose columns contain the right singular vectors of `x`.

Comparing the components of prcomp output result1 with components of svd output result2, it is clear that

(1) eigenvector matrix result1$rotation is equivalent to result2$v

> result1$rotation[, 1:3] #show first 3 columns due to page size limit PC1 PC2 PC3 mpg -0.3625305 0.01612440 -0.22574419 cyl 0.3739160 0.04374371 -0.17531118 disp 0.3681852 -0.04932413 -0.06148414 hp 0.3300569 0.24878402 0.14001476 drat -0.2941514 0.27469408 0.16118879 wt 0.3461033 -0.14303825 0.34181851 qsec -0.2004563 -0.46337482 0.40316904 vs -0.3065113 -0.23164699 0.42881517 am -0.2349429 0.42941765 -0.20576657 gear -0.2069162 0.46234863 0.28977993 carb 0.2140177 0.41357106 0.52854459 > s2$v[, 1:3] [,1] [,2] [,3] [1,] -0.3625305 0.01612440 -0.22574419 [2,] 0.3739160 0.04374371 -0.17531118 [3,] 0.3681852 -0.04932413 -0.06148414 [4,] 0.3300569 0.24878402 0.14001476 [5,] -0.2941514 0.27469408 0.16118879 [6,] 0.3461033 -0.14303825 0.34181851 [7,] -0.2004563 -0.46337482 0.40316904 [8,] -0.3065113 -0.23164699 0.42881517 [9,] -0.2349429 0.42941765 -0.20576657 [10,] -0.2069162 0.46234863 0.28977993 [11,] 0.2140177 0.41357106 0.52854459

> sapply(1:ncol(result1$rotation), function(i) all.equal(as.vector(result1$rotation[, i]), result2$v[, i])) [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

(2) PCA scores matrix result1$x can be derived from the scaled data X by multiplying by right singular vectors result2$v

> scores <- X %*% result2$v

> head(result1$x[, 1:3]) #show first 3 columns due to page size limit
PC1 PC2 PC3 Mazda RX4 -0.64686274 1.7081142 -0.5917309 Mazda RX4 Wag -0.61948315 1.5256219 -0.3763013 Datsun 710 -2.73562427 -0.1441501 -0.2374391 Hornet 4 Drive -0.30686063 -2.3258038 -0.1336213 Hornet Sportabout 1.94339268 -0.7425211 -1.1165366 Valiant -0.05525342 -2.7421229 0.1612456

> head(scores[, 1:3]) [,1] [,2] [,3] Mazda RX4 -0.64686274 1.7081142 -0.5917309 Mazda RX4 Wag -0.61948315 1.5256219 -0.3763013 Datsun 710 -2.73562427 -0.1441501 -0.2374391 Hornet 4 Drive -0.30686063 -2.3258038 -0.1336213 Hornet Sportabout 1.94339268 -0.7425211 -1.1165366 Valiant -0.05525342 -2.7421229 0.1612456

> sapply(1:ncol(result1$x), function(i) all.equal(result1$x[, i], scores[, i])) [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

(3) PCA scores matrix result1$x is also equivalent to left singular vectors result2$u multiplied by singular values result2$d

> scores2 <- result2$u %*% diag(result2$d)

> head(result1$x[, 1:3]) PC1 PC2 PC3 Mazda RX4 -0.64686274 1.7081142 -0.5917309 Mazda RX4 Wag -0.61948315 1.5256219 -0.3763013 Datsun 710 -2.73562427 -0.1441501 -0.2374391 Hornet 4 Drive -0.30686063 -2.3258038 -0.1336213 Hornet Sportabout 1.94339268 -0.7425211 -1.1165366 Valiant -0.05525342 -2.7421229 0.1612456

> head(scores2[, 1:3]) [,1] [,2] [,3] [1,] -0.64686274 1.7081142 -0.5917309 [2,] -0.61948315 1.5256219 -0.3763013 [3,] -2.73562427 -0.1441501 -0.2374391 [4,] -0.30686063 -2.3258038 -0.1336213 [5,] 1.94339268 -0.7425211 -1.1165366 [6,] -0.05525342 -2.7421229 0.1612456

> sapply(1:ncol(result1$x), function(i) all.equal(as.vector(result1$x[, i]), scores2[, i])) [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

(4) the standard deviations of principal components are equivalent to singular values divided by the square root of n - 1, where n is the number of samples.

> all.equal(result1$sdev, result2$d / sqrt(nrow(X) - 1)) [1] TRUE

Science Notes

Friday, January 27, 2023

A Singularity definition file with R installation

Friday, January 13, 2023

wsl logon failure: the user has not been granted the requested logon type at this computer.

Thursday, October 6, 2022

Illustration of PCA using SVD with a Real Example