A multipanel data visualization of PC1 gene loadings against gene features

Hi! I am Saadia Jameel, a first year BME Ph.D. student

A multipanel data visualization of PC1 gene loadings against gene features

1. What data types are you visualizing?

Categorical- PCA unscaled and PCA scaled Quantitative data- PC1 gene loading Gene expression variability (variance) Mean gene expression

2. What data encodings (geometric primitives and visual channels) are you using to visualize these data types?

Geometric primitives- Points (representing binned summaries of genes) Lines (encoding gene variance in the bar chart) Visual channels- Colour hue (to differentiate between the scaled and unscaled PCA computation) Position along x axis encodes gene features, binned SD or binned mean expression in panels A and B Position along x axis encodes gene identity in panel C Position along y axis encodes gene loadings on PC1 in panels A and B Position along y axis encodes encodes gene variance across spots in panel C

3. What about the data are you trying to make salient through this data visualization?

Genes with higher variability tend to contribute more to PC1. In unscaled PCA, genes with higher variance disproportionately dominate PC1. In scaled PCA, variance differences are suppressed, and thus the curve is flatter but still increasing. Highly expressed genes (greater mean), have better signal-to-noise ratio, and therefore contribute more to PC1. Scaling reduces but does not eliminate this trend.

The identity of PC1-driving genes depends on scaling. This indicates that unscaled PCA is dominated by high variance genes whereas scaled PCA captures coordinated expression patterns independent of magnitude.

4. What Gestalt principles or knowledge about perceptiveness of visual encodings are you using to accomplish this?

Color hue is used to distinguish between PCA runs with and without scaling, as humans perceive differences in hue effectively for categorical data.

The Gestalt principle of continuity is used to highlight the relationship between gene loadings on the first principal component and gene features such as mean expression and variance. The connected line plots encourage the viewer to perceive smooth trends rather than isolated points.

In the bar chart, length is used to highlight quantitative differences in gene variance, as humans are effective at comparing magnitudes when length is used as the visual encoding.

Additionally, the Gestalt principle of proximity is used in the bar chart to group genes by PCA condition (scaled versus unscaled), helping the viewer quickly distinguish between the two sets of PC1 driving genes.

5. Code (paste your code in between the ``` symbols)

# Spatial transcriptomics PCA:
#   - Row 1: SD-binned trend (UNSCALED vs SCALED) + Mean-binned trend (UNSCALED vs SCALED)
#   - Row 2: Top PC1 driver genes bar chart (UNSCALED and SCALED)

install.packages(c("ggplot2","dplyr","patchwork","ggrepel"))
library(ggplot2)
library(dplyr)
library(patchwork)
library(ggrepel)

# reset theme- I had to do this as I was getting errors on and off
theme_set(theme_gray())

# consistent use of colour heus across figures
pca_colors <- c(
  "UNSCALED" = "#00BFC4",  # teal
  "SCALED"   = "#F8766D"   # coral
)

# Read data
data <- read.csv('~/Documents/genomic-data-visualization-2026/data/Visium-IRI-ShamR_matrix.csv.gz')

# get familiar with the data
pos <- data[, c('x','y')]
rownames(pos) <- data[,1]
gexp <- data[, 4:ncol(data)]
rownames(gexp) <- data[,1]
gexp <- as.matrix(gexp)
storage.mode(gexp) <- "numeric"

# log transformed counts per million normalize
totgexp <- rowSums(gexp)
mat <- log10(sweep(gexp, 1, totgexp, "/") * 1e6 + 1)  # spots x genes

# compute gene features across spots
gene_mean <- colMeans(mat)
gene_sd   <- apply(mat, 2, sd)

# Keep only genes with nonzero SD and finite values
keep <- is.finite(gene_sd) & gene_sd > 0
mat <- mat[, keep, drop = FALSE]
gene_mean <- gene_mean[keep]
gene_sd   <- gene_sd[keep]

# PCA- Unscaled
pca_ns <- prcomp(mat, center = TRUE, scale. = FALSE)
pc1_load <- pca_ns$rotation[, 1]

# PCA- Scaled genes: z-score each gene across spots
mat_scaled <- scale(mat, center = TRUE, scale = TRUE)
pca_s <- prcomp(mat_scaled, center = FALSE, scale. = FALSE)
pc1_load_s <- pca_s$rotation[, 1]

# Build gene-level data frames (aligned by gene names) 
df <- data.frame(
  gene = names(pc1_load),
  pc1_loading = as.numeric(pc1_load),
  mean = gene_mean[names(pc1_load)],
  sd   = gene_sd[names(pc1_load)]
)

df_s <- data.frame(
  gene = names(pc1_load_s),
  pc1_loading = as.numeric(pc1_load_s),
  mean = gene_mean[names(pc1_load_s)],
  sd   = gene_sd[names(pc1_load_s)]
)

# Remove directionality: use |loading| for contribution magnitude
df$abs_loading   <- abs(df$pc1_loading)
df_s$abs_loading <- abs(df_s$pc1_loading)


# bin gened based on standard deviation and mean for easy of visualization
n_bins <- 20

# unscaled
df_var_binned <- df %>%
  mutate(var_bin = ntile(sd, n_bins)) %>%
  group_by(var_bin) %>%
  summarise(
    mean_sd = mean(sd),
    median_abs_loading = median(abs_loading),
    .groups = "drop"
  )

# scaled
df_s_var_binned <- df_s %>%
  mutate(var_bin = ntile(sd, n_bins)) %>%
  group_by(var_bin) %>%
  summarise(
    mean_sd = mean(sd),
    median_abs_loading = median(abs_loading),
    .groups = "drop"
  )

# unscaled
df_mean_binned <- df %>%
  mutate(mean_bin = ntile(mean, n_bins)) %>%
  group_by(mean_bin) %>%
  summarise(
    mean_expr = mean(mean),
    median_abs_loading = median(abs_loading),
    .groups = "drop"
  )

# scaled
df_s_mean_binned <- df_s %>%
  mutate(mean_bin = ntile(mean, n_bins)) %>%
  group_by(mean_bin) %>%
  summarise(
    mean_expr = mean(mean),
    median_abs_loading = median(abs_loading),
    .groups = "drop"
  )

# Combine scaled and unscaled for overlay plots
df_var_binned$PCA   <- "UNSCALED"
df_s_var_binned$PCA <- "SCALED"
df_var_combined <- bind_rows(df_var_binned, df_s_var_binned) %>%
  mutate(PCA = factor(trimws(PCA), levels = c("UNSCALED","SCALED")))

df_mean_binned$PCA   <- "UNSCALED"
df_s_mean_binned$PCA <- "SCALED"
df_mean_combined <- bind_rows(df_mean_binned, df_s_mean_binned) %>%
  mutate(PCA = factor(trimws(PCA), levels = c("UNSCALED","SCALED")))

# Plot A
p_var <- ggplot(df_var_combined,
                aes(x = mean_sd, y = median_abs_loading, color = PCA)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  scale_color_manual(values = pca_colors, drop = FALSE) +
  labs(
    title = "Gene expression variability (SD) vs PC1 gene contribution",
    x = "Gene expression variability (SD, binned)",
    y = "PC1 gene contribution (median |loading|)",
    color = "PCA"
  ) +
  theme_classic() +
  theme(
    plot.title   = element_text(size = 11, , face="bold"),
    axis.title.x = element_text(),
    axis.title.y = element_text()
  )

# Plot B
p_mean <- ggplot(df_mean_combined,
                 aes(x = mean_expr, y = median_abs_loading, color = PCA)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  scale_color_manual(values = pca_colors, drop = FALSE) +
  labs(
    title = "Mean gene expression vs PC1 gene contribution",
    x = "Mean gene expression (log-CPM, binned)",
    y = "PC1 gene contribution (median |loading|)",
    color = "PCA"
  ) +
  theme_classic() +
  theme(
    plot.title   = element_text(size = 11, face="bold"),
    axis.title.x = element_text(),
    axis.title.y = element_text()
  )

# use a bar chart to show variance and identity of the top PC1-driver genes (scaled vs unscaled)
top_n <- 20

top_unscaled <- df %>%
  arrange(desc(abs_loading)) %>%
  slice_head(n = top_n) %>%
  mutate(PCA = "UNSCALED")

top_scaled <- df_s %>%
  arrange(desc(abs_loading)) %>%
  slice_head(n = top_n) %>%
  mutate(PCA = "SCALED")

top_bars <- bind_rows(top_unscaled, top_scaled) %>%
  select(gene, PCA, pc1_loading, abs_loading, sd) %>%
  mutate(
    PCA = factor(trimws(PCA), levels = c("UNSCALED","SCALED")),
    variance = sd^2
  )

# Spacer levels create a true gap between groups
spacers <- c("SPACER1", "SPACER2")

x_levels <- c(
  paste0("UNSCALED__", top_unscaled$gene),
  spacers,
  paste0("SCALED__",   top_scaled$gene)
)

spacer_df <- data.frame(
  gene = NA,
  PCA = factor("SPACER", levels = c("UNSCALED","SCALED","SPACER")),
  pc1_loading = NA,
  abs_loading = NA,
  sd = NA,
  variance = NA,
  x_id = spacers
)

top_bars <- top_bars %>%
  mutate(x_id = paste0(as.character(PCA), "__", gene)) %>%
  bind_rows(spacer_df) %>%
  mutate(
    x_id = factor(x_id, levels = x_levels),
    PCA  = factor(as.character(PCA), levels = c("UNSCALED","SCALED","SPACER"))
  )

# plot: SAME X, but Y = variance
p_topN_var <- ggplot(top_bars, aes(x = x_id, y = variance, fill = PCA)) +
  geom_col(width = 0.8, na.rm = TRUE) +
  scale_x_discrete(labels = function(z)
    ifelse(grepl("SPACER", z), "", sub(".*__", "", z))
  ) +
  scale_fill_manual(
    values = c(pca_colors, "SPACER" = NA),
    breaks = c("UNSCALED","SCALED"),
    drop = FALSE
  ) +
  labs(
    title = paste0("Variance of top PC1 driver genes"),
    x = "Genes",
    y = "Gene variance across spots",
    fill = "PCA"
  ) +
  theme_classic() +
  theme(
    plot.title   = element_text(size = 11, hjust = 0.5, face="bold"),
    axis.text.x  = element_text(angle = 45, hjust = 1),
    axis.title.x = element_text(),
    axis.title.y = element_text()
  )

# multi panel visualization
((p_var | p_mean) / (p_topN_var)) +
  plot_annotation(tag_levels = "A")
 

6. Resources

I used R documentation and the ? help function on R itself to understand functions.

I used ChatGpt for clean up tasks, in my plots. The following are a few prompts used. Increase the space between the two categories to show greater proximity. Make the colour heu used consistent across plots.

03 Feb 2026

« HW2 HW2 Submission »