EC1- tSNE on genes vs on PCs

Hi I am Amanda and this is my dog muimui from Hong Kong!

Describe your figure briefly so we know what you are depicting (you no longer need to use precise data visualization terms as you have been doing).

The data was normalized by the count and then log transformed. This is an animation of 3 plots.

I first performed tSNE on gene expressions. The first plot encodes the expression of PTPRC as a gradient of color hue from gray to red in the tSNE space when tSNE was performed on genes. We see several distinct clusters, and some clusters from the bottom half of the tSNE plot seem to upregulate PTPRC. The choice of PTPRC would be explained later.

I plotted a scree plot of the standard deviations over the number of PCs and observed an elbow at about the 8th PC. This indicates that most variance is being captured by the first 8 PCs.Thus, I performed tSNE on the first 8 PCs. The second plot encodes the expression of PTPRC as a gradient of color hue from gray to red in the tSNE space when tSNE was performed on the first 8 PCs. We see that the clusters seem less distinct, which makes sense because information on some genes are lost. But still, we see PTPRC being upregulated in clusters from the bottom half of the tSNE plot.

From the scree plot, we see that the standard deviation is still pretty high for the 3rd PC. But for the sake of exploration, I performed tSNE on the first 2 PCs. The third plot encodes the expression of PTPRC as a gradient of color hue from gray to red in the tSNE space when tSNE was performed only on the first 2 PCs. Here, we lost the pattern of upregulation of PTPRC in specific clusters.

PTPRC was specifically chosen because it is the gene with the highest loading (highest contribution) in the 3rd PC, and so I expect the expression of PTPRC in tSNE space to be different when tSNE is performed only on the first 2 PCs. Variation in PC3 is highly influenced by PTPRC expression. If we include PC3, where PTPRC contributes the most, the tSNE structure will reflect PTPRC’s expression pattern more strongly. If we exclude PC3 (only use the first two PCs), the tSNE representation will be missing the key variation from PTPRC, causing a different cluster structure or spatial distribution.

From this exploration, we see that we retain the most information when we perform non-linear dimensionality reduction on genes directly. However, to save time from running through the large number of genes, we might want to perform linear dimensionality reduction on the genes first before performing non-linear dimensionality reduction on those data (the PCs). In this case, it is important that we use the number of PCs that capture most of the variance of the genes such that we don’t lose important information.

Code (paste your code in between the ``` symbols)

file <- "data/pikachu.csv.gz"
install.packages('gganimate')
install.packages('gifski')

data <- read.csv(file)
data[1:5,1:10]
dim(data)

pos <- data[, 5:6]
rownames(pos) <- data$cell_id
gexp <- data[, 7:ncol(data)]
rownames(gexp) <- data$barcode


## normalize by total expression
norm <- gexp/rowSums(gexp) * 10000
norm[1:5,1:5]
loggexp <- log10(norm + 1) 

## normalization by log transform
loggexp <- log10(norm + 0.000001) 


## try many ks for genes
ks = seq.int(1, 20, 2)
ec1totw <- sapply(ks, function(k) {
  print(k)
  set.seed(1)
  com <- kmeans(loggexp, centers=k)
  return(com$tot.withinss)
})

## find optimal k from elbow plot
library(ggplot2)
ec1df0<-data.frame(ks,ec1totw)
gene_totw<-ggplot(ec1df0, aes(x=ks, y=ec1totw)) + geom_point(size=3)+
  labs(
    title = "Total Withinness for different k's",
    x = "Number of k",
    y = "Total Withinness"
  ) +
  theme_bw()
gene_totw

## k-means clustering
set.seed(1)
genes_com <- kmeans(loggexp, centers=15)
genes_clusters <- genes_com$cluster
genes_clusters <- as.factor(genes_clusters) ## tell R it's a categorical variable
names(genes_clusters) <- rownames(gexp)
head(genes_clusters)

## tSNE
set.seed(1)
genes_emb <- Rtsne::Rtsne(loggexp)
head(genes_emb$Y)

#PTPRC in tSNE space
genes_tSNE <- data.frame(genes_emb$Y, clusters=genes_clusters, gene = loggexp[, 'PTPRC'])
genes_tSNE_plot<- ggplot(genes_tSNE, aes(x=X1, y=X2, colour=gene)) + geom_point(size=1)+
  labs(
    title = "tSNE on genes",
    x = "tSNE1",
    y = "tSNE2"
  ) +
  scale_color_gradient(low = 'lightgrey', high='red') +
  theme_bw()
genes_tSNE_plot





## perform pca
set.seed(1)
pcs<- prcomp(loggexp)
names(pcs)
pcs$sdev
pcs$rotation[1:5,1:5]
?prcomp

## visualize a scree plot
screedf<- data.frame(sdev=pcs$sdev,index=1:length(pcs$sdev))
ggplot(screedf, aes(x=index, y=sdev))+geom_point()
ggplot(screedf[1:25,], aes(x=index, y=sdev))+geom_point()
##    most variance is being captured by first 8 PCs


## tSNE on first 8 PCs
elbow <- sapply(2:20, function(k) {
  out <- kmeans(pcs$x[,1:8], centers=k)
  out$tot.withinss
})
plot(2:20, elbow)
# 15 is a reasonable elbow for clustering on PCA data too

## k-means clustering
pc_com <- kmeans(pcs$x[,1:8], centers=15)
pc_cluster <- pc_com$cluster
pc_cluster <- as.factor(pc_cluster) ## tell R it's a categorical variable
names(pc_cluster) <- rownames(pcs)
head(pc_cluster)

## tSNE
pc_emb <- Rtsne::Rtsne(pcs$x[,1:8])
head(pc_emb$Y)

#PTPRC in tSNE space
pc_tSNE <- data.frame(pc_emb$Y, clusters=pc_cluster, gene = loggexp[, 'PTPRC'])
pc_tSNE_plot<- ggplot(pc_tSNE, aes(x=X1, y=X2, color=gene)) + geom_point(size=1)+
  labs(
    title = "tSNE on first 8 PCs",
    x = "tSNE1",
    y = "tSNE2"
  ) +
  scale_color_gradient(low = 'lightgrey', high='red') +
  theme_bw()
pc_tSNE_plot




## tSNE on first 2 PCs
elbow <- sapply(2:20, function(k) {
  out <- kmeans(pcs$x[,1:2], centers=k)
  out$tot.withinss
})
plot(2:20, elbow)
# 15 is a reasonable elbow for clustering on PCA data too

pc3_loadings <- pcs$rotation[, 3]
sort(pc3_loadings, decreasing=FALSE)
#PTPRC has the highest loading in PC3, which is not captured when performing tSNE on PC[,1:2]



## k-means clustering
pc_com <- kmeans(pcs$x[,1:2], centers=15)
pc_cluster <- pc_com$cluster
pc_cluster <- as.factor(pc_cluster) ## tell R it's a categorical variable
names(pc_cluster) <- rownames(pcs)
head(pc_cluster)

## tSNE on first 2 PCs
pc2_emb <- Rtsne::Rtsne(pcs$x[,1:2])
head(pc2_emb$Y)

#clusters in tSNE space
pc2_tSNE <- data.frame(pc2_emb$Y, clusters=pc_cluster, gene = loggexp[, 'PTPRC'])
pc2_tSNE_plot<- ggplot(pc2_tSNE, aes(x=X1, y=X2, color=gene)) + geom_point(size=1)+
  labs(
    title = "tSNE on first 2 PCs",
    x = "tSNE1",
    y = "tSNE2"
  ) +
  scale_color_gradient(low = 'lightgrey', high='red') +
  theme_bw()
pc2_tSNE_plot





## in order to animate, need to make new data frame with all the information
# Create the animation data frame with order labels
anim_df <- rbind(
  cbind(genes_tSNE, order = "tSNE on Genes"),
  cbind(pc_tSNE, order = "tSNE on first 8 PCs"),
  cbind(pc2_tSNE, order = "tSNE on first 2 PCs")
)

library(gganimate)
# Create the base plot
p <- ggplot(anim_df, aes(x = X1, y = X2, color = gene)) +
  geom_point(size = 1, alpha = 0.7) + 
  scale_color_gradient(low = 'lightgrey', high='red') +
  labs(x = "tSNE1", y = "tSNE2") +
  theme_bw() +
  theme(legend.position = "right") +
  transition_states(order, transition_length = 3, state_length = 3) +
  ggtitle("{closest_state}") +  # Dynamic title based on 'order'
  ease_aes('linear') +
  view_follow()


anim <- p + transition_states(order) + view_follow() + ease_aes('linear')
animate(anim, height=300, width=300) 
anim_save("hwEC1.gif", animation = last_animation(), path = "C:/Users/looia/OneDrive - Johns Hopkins/2025 Spring/Genomic Data Visualizations/ec1")
?animate

Code for volcano plot referenced https://jef.works/genomic-data-visualization-2024/blog/2024/02/14/challin1/

The final png was merged on https://products.groupdocs.app/merger/png#folderName=dd1be75e-700e-4c83-b0c3-8787377d0c58 because the the viewport on R is too small for my figure.

28 Feb 2025

hwEC1

« Difference between linear or nonlinear dimensionality reduction Visualizing the Impact of the Number of PCs used to perform Nonlinear Dimensionality Reduction using tSNE »