Animation of Non-Linear Dimensionality Reduction (tSNE) on Varied number of PCs

I’m pursuing the combined BS/MSE in Biomedical Engineering, and in my free time I love playing basketball, trying different types of tea, and unwinding with podcasts.

Animation of Non-Linear Dimensionality Reduction (tSNE) on Varied number of PCs

4. If I perform non-linear dimensionality reduction on PCs, what happens when I vary how many PCs I use? Write a brief description of your figure so we know what you are visualizing.

This animation explores how the number of principal components (PCs) affects the structure of a t-SNE embedding of gene expression data from the Xenium spatial transcriptomics kidney dataset. PC counts of 2, 5, 10, 15, 20, 30, and 50 were chosen to span a range from well below to well above the scree plot elbow at ~10, capturing the full arc of how embedding quality changes. Each frame shows two panels: a scree plot on the left highlighting how many PCs are being used, and a t-SNE plot on the right showing how cells group in 2D space when that many PCs are fed into t-SNE. Cells are colored by k-means cluster (k=7, determined by elbow analysis) on a random subsample of 10,000 cells. At very few PCs (2-5), large white spaces appear between and within clusters, suggesting that cells labeled as the same cluster may not actually be transcriptionally similar, as the embedding is capturing too little variance to meaningfully group cells together. Around 10 PCs, distinct cell populations emerge as tight, well-separated islands with clear boundaries between them, consistent with where the scree plot elbow falls. Beyond 20-30 PCs, the embedding visibly degrades as clusters begin bleeding into each other and overlapping, with cells from different populations mixing together in the same regions of the plot. This cluster-on-cluster overlap at high PC counts shows that including too many PCs introduces noise that distorts the t-SNE’s ability to separate biologically distinct populations. A random subsample of 10,000 cells was used out of the full 85,880 cells in the dataset because t-SNE was run 7 times, once per PC count, and running it on the full dataset would have been computationally prohibitive. 10,000 cells was chosen as a reasonable tradeoff that still captures the diversity of cell populations while keeping runtime manageable. The number of clusters was set to k=7 based on elbow analysis performed in a prior analysis of this dataset. PC counts were capped at 50 since the scree plot shows standard deviation effectively plateauing well before that point, making higher PC counts unlikely to add meaningful biological signal.

##Code

library(ggplot2)
library(Rtsne)
library(gganimate)
library(magick)
library(dplyr)

#Load data 
data <- read.csv('~/Desktop/GDV/Xenium-IRI-ShamR_matrix.csv')
pos  <- data[, c('x', 'y')]
rownames(pos) <- data[, 1]
gexp <- data[, 4:ncol(data)]
rownames(gexp) <- data[, 1]

#Normalize & log-transform
totgexp <- rowSums(gexp)
totgexp[totgexp == 0] <- 1
mat <- log10(gexp / totgexp * 1e6 + 1)

#PCA
set.seed(42)
pcs <- prcomp(mat, center = TRUE, scale = FALSE)

#Scree data (first 50 PCs)
sdev_df <- data.frame(
  PC   = 1:50,
  sdev = pcs$sdev[1:50]
)

#PC counts to animate over
pc_steps <- c(2, 5, 10, 15, 20, 30, 50)

# AI prompt: I want to animate a scree plot over pc_steps <- c(2,5,10,15,20,30,50).
# Can you build a long-format dataframe that repeats the scree data for each step
# and includes a column indicating whether each PC index is included in that step?
df_sdev <- do.call(rbind, lapply(pc_steps, function(npcs) {
  data.frame(sdev_df, npcs = npcs, included = sdev_df$PC <= npcs)
}))
df_sdev$npcs_f <- factor(df_sdev$npcs, levels = pc_steps)

#Subsample to 10,000 cells (85,880 total: full run would take hours)
set.seed(42)
N       <- nrow(pcs$x)
sub_idx <- if (N > 10000) sample(N, 10000) else seq_len(N)
pcs_sub <- pcs$x[sub_idx, ]

#Run tSNE + kmeans for each PC count
#AI prompt: For each value in pc_steps, I want to run tSNE on the first npcs
#columns of pcs$x and also run kmeans with k=7 on the same PC subset. Can you
#build a long-format dataframe with tSNE coordinates, cluster labels, and npcs
#as a column so I can animate over it with gganimate?
set.seed(42)
df_emb <- do.call(rbind, lapply(pc_steps, function(npcs) {
  message("Running tSNE with ", npcs, " PCs ...")
  ts <- Rtsne::Rtsne(pcs_sub[, 1:npcs, drop = FALSE],
                     dims             = 2,
                     perplexity       = 30,
                     pca              = FALSE,
                     verbose          = FALSE,
                     check_duplicates = FALSE)
  km <- kmeans(pcs_sub[, 1:npcs, drop = FALSE], centers = 7, nstart = 50, iter.max = 50)
  data.frame(
    tSNE1   = ts$Y[, 1],
    tSNE2   = ts$Y[, 2],
    cluster = as.factor(km$cluster),
    npcs    = npcs
  )
}))
df_emb$npcs_f <- factor(df_emb$npcs, levels = pc_steps)

#Shared theme
base_theme <- theme_bw(base_size = 13) +
  theme(
    legend.position  = "bottom",
    plot.title       = element_text(face = "bold", size = 13),
    panel.grid.minor = element_blank()
  )

#Left panel: Clean scree plot
#AI prompt: How do I make a ggplot scree plot that animates through each value
#in pc_steps, showing a grey baseline for all 50 PCs and a blue highlighted
#line and points only for the included PCs, plus a vertical dashed line at
#the cutoff that moves each frame?
p_scree <- ggplot(df_sdev, aes(x = PC)) +
  geom_line(aes(y = sdev), color = "grey70", linewidth = 0.5) +
  geom_point(aes(y = sdev), color = "grey70", size = 1.2) +
  geom_line(
    data = df_sdev %>% filter(included),
    aes(y = sdev),
    color = "#3366CC", linewidth = 1.3
  ) +
  geom_point(
    data = df_sdev %>% filter(included),
    aes(y = sdev),
    color = "#3366CC", size = 2.5
  ) +
  geom_vline(
    data = df_sdev %>% distinct(npcs),
    aes(xintercept = npcs),
    color = "#3366CC", linetype = "dashed", linewidth = 0.8
  ) +
  labs(
    title = "Scree Plot — {closest_state} PCs used",
    x = "PC Index",
    y = "Standard Deviation"
  ) +
  base_theme +
  theme(legend.position = "none") +
  transition_states(npcs, transition_length = 2, state_length = 5) +
  ease_aes("cubic-in-out")

#Right panel: tSNE colored by kmeans cluster (k=7)
cluster_pal <- c("#E41A1C","#377EB8","#4DAF4A","#984EA3",
                 "#FF7F00","#A65628","#F781BF")

p_tsne <- ggplot(df_emb, aes(x = tSNE1, y = tSNE2, color = cluster)) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values = cluster_pal, name = "Cluster") +
  labs(
    title = "tSNE Embedding — {closest_state} PCs",
    x = "tSNE 1",
    y = "tSNE 2"
  ) +
  base_theme +
  theme(legend.position = "right") +
  guides(color = guide_legend(ncol = 1,
                              override.aes = list(size = 4, alpha = 1))) +
  transition_states(npcs, transition_length = 2, state_length = 5) +
  enter_fade() + exit_fade() +
  ease_aes("cubic-in-out")

#Render & stitch side-by-side 
#AI prompt: I have two gganimate animations rendered with magick_renderer().
#How do I combine them into a single GIF with the scree plot on the left and
#the tSNE on the right, stitching corresponding frames together side by side?
nframes <- 140
fps     <- 10

gif_scree <- animate(p_scree, fps = fps, nframes = nframes,
                     width = 480, height = 480, res = 110,
                     renderer = magick_renderer())

gif_tsne  <- animate(p_tsne, fps = fps, nframes = nframes,
                     width = 620, height = 480, res = 110,
                     renderer = magick_renderer())

combined <- image_append(c(gif_scree[1], gif_tsne[1]))
for (i in 2:nframes) {
  combined <- c(combined, image_append(c(gif_scree[i], gif_tsne[i])))
}

# Display and save
combined
image_write(combined, path = "~/Desktop/GDV/sakshi_hw_ec1.gif")

# References:
# gganimate documentation: https://gganimate.com/
# Rtsne package: https://github.com/jkrijthe/Rtsne
# magick: https://docs.ropensci.org/magick/
# k=7 clusters determined by elbow method in prior HW3 analysis
# AI assistance used and prompts mentioned above

01 Mar 2026

HW EC1

« Deconvolution on Visium Dataset EC2-Integrated Spatial Analysis of Proximal Tubule Cell Types Using K-means Clustering and STdeconvolve (K = 5) Across Visium and Xenium »