Effect of varying number of principle components before performing non-linear dimensional reduction


Qingyu Chen
I'm a First-year BME Ph.D. student in Yi Lab :-)

Effect of varying number of principle components before performing non-linear dimensional reduction

Plot Description

The visualization shows the effect of varying number of principle components (2,4,6,10,20,30) on later non-linearity reduced space.

Expression of CD1c gene is color-coded to see how well the corresponding tSNE clusters relavent data.

From scree plot, the best PC cut-off would be 6-10.

If I perform non-linear dimensionality reduction on PCs, what happens when I vary how many PCs should I use?

When performing non-linear dimensionality reduction on principal components (PCs), the choice of how many PCs to use can significantly impact the resulting reduced space.

  1. Few PCs (e.g., 2 or 4): Using a small number of PCs may lead to underfitting. The reduced space might not capture the full complexity of the data, resulting in poor representation. Clusters in the reduced space are poorly separated.

  2. Intermediate PCs (e.g., 6 or 10): This range is considered optimal. It strikes a balance between capturing relevant information and avoiding overfitting. The reduced space should retain most of the data’s structure while minimizing noise.CD1c clusters can be seen on these plots.

  3. Many PCs (e.g., 20 or 30): Including more PCs can lead to overfitting. The reduced space may become too specific to trade-off across all genes instead of important ones, making it too tightly packed, potentially obscuring meaningful patterns.

References

[1] https://r-lang.com/how-to-create-empty-list-in-r/#:~:text=How%20to%20Create%20an%20Empty%20List%20in%20R,list%20…%204%20Create%20an%20empty%20vector%20

[2] https://gganimate.com/reference/animate.html

[3] https://stackoverflow.com/questions/37397303/change-label-of-gganimate-frame-title


data <- read.csv('~/Desktop/genomic-data-visualization-2024/data/pikachu.csv.gz', 
                 row.names = 1)
pos <- data[, 4:5]
gexp <- data[6:ncol(data)]
rownames(pos) <- rownames(gexp) <- data$cell_id
head(pos)
head(gexp)

## maybe you want to normalize?
## maybe you want to get kmeans clusters?
## save.image('test.RData') 
## load('test.RData')

pcs <- prcomp(gexp)
emb <- Rtsne::Rtsne(pcs$x[,1:10])$Y
geneofinterest <- log10(gexp$KRT6B+1)
  
## first make a static visualization 
library(ggplot2)
df1 <- data.frame(pcs$x[,1:2], gene=geneofinterest)
colnames(df1) <- c('PC_x', 'PC_y', 'gene')
ggplot(df1) + geom_point(aes(x = PC_x, PC_y, col=gene))

df2 <- data.frame(emb, gene=geneofinterest)
colnames(df2) <- c('PC_x', 'PC_y', 'gene')
ggplot(df2) + geom_point(aes(x = PC_x, PC_y, col=gene))

df3 <- data.frame(pos, gene=geneofinterest)
colnames(df3) <- c('PC_x', 'PC_y', 'gene')
ggplot(df3) + geom_point(aes(x = PC_x, PC_y, col=gene, size=0.1), size=0.1)

library(gganimate)
df <- rbind(cbind(df1, order=1, size=1), 
            cbind(df2, order=2, size=1),
            cbind(df3, order=3, size=0.01))
head(df)
p <- ggplot(df) + geom_point(aes(x = PC_x, y = PC_y, col=gene, size=size), size=0.1)
p

anim <- p + transition_states(order) + view_follow()
animate(anim, height = 300, width = 300)

## if you are having trouble rendering, try
## install.packages('gifski')

df1 <- data.frame(pos, gene=log10(gexp$CD9+1))
colnames(df1) <- c('x', 'y', 'gene')
ggplot(df1) + geom_point(aes(x = x, y, col=gene), size=0.1) +
  scale_color_viridis_c()

df2 <- data.frame(pos, gene=log10(gexp$CD3E+1))
colnames(df2) <- c('x', 'y', 'gene')
ggplot(df2) + geom_point(aes(x = x, y, col=gene), size=0.1) +
  scale_color_viridis_c()

df <- rbind(cbind(df1, label='CD9'), 
            cbind(df2, label='CD3E'))
head(df)
p <- ggplot(df) + geom_point(aes(x = x, y = y, col=gene), size=0.1) +
  scale_color_viridis_c()
p

anim <- p + transition_states(label) + view_follow() +
  labs(title = '{closest_state}') 
animate(anim, height = 300, width = 300)