HW2
external resources: prompt: make setwd as output path, improve the layout structure.
1. What data types are you visualizing?
I visualize (1) quantitative gene-level data: PC1 loadings, mean expression, and variance for each gene; and (2) spatial cell-level data: x,y coordinates and per-cell PC1 scores (and a derived PC1 “signature” score) for each cell.
2. What data encodings (geometric primitives and visual channels) are you using to visualize these data types?
Panels A–B use the geometric primitive of points (one point per gene). For each gene, position encodes gene mean (x-axis) or gene variance (x-axis) and PC1 loading (y-axis). A diverging color scale encodes the sign and magnitude of PC1 loading, which makes positive vs negative loadings immediately separable.
Panels C–D use points (one point per cell) positioned by spatial coordinates (x,y). Color encodes either the PC1 score (Panel C) or a derived PC1 signature defined as mean(expression of top +loading genes) − mean(expression of top −loading genes) (Panel D). The same diverging color semantics across panels supports direct comparison.
3. What about the data are you trying to make salient through this data visualization?
I aim to make salient whether the first principal component is primarily driven by gene-level properties (mean/variance) and whether that dominant axis corresponds to coherent spatial heterogeneity in the tissue. In particular, I want to show more details on (i)how extreme positive vs negative PC1 loadings distribute across gene mean/variance, and (ii) whether cells with high vs low PC1 scores form structured spatial domains rather than appearing spatially random.
4. What Gestalt principles or knowledge about perceptiveness of visual encodings are you using to accomplish this?
I rely on position as a high-precision visual channel to reveal relationships between loading and mean/variance (Panels A–B). I use a diverging color map as a preattentive feature to segment genes/cells into positive vs negative PC1 contributions without requiring point-by-point reading. Across panels, I keep the encoding consistent (diverging color centered at 0) to support comparison and reduce cognitive load. In the spatial maps, proximity and spatial continuity (Gestalt: proximity/common region) help the viewer perceive coherent domains and boundaries if present.
5. Code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
library(readr)
library(dplyr)
library(ggplot2)
library(patchwork)
library(irlba)
library(matrixStats)
library(scales)
# ---- user settings ----
setwd("~/Desktop/JHU/26spring/Genomic data vis/hw1")
your_name <- "YOURNAME"
input_path <- file.path(getwd(), "Xenium-IRI-ShamR_matrix.csv")
out_dir <- file.path(getwd(), "homework", "hw2")
out_path <- file.path(out_dir, paste0(your_name, ".png"))
dir.create(out_dir, recursive = TRUE, showWarnings = FALSE)
# ---- load data ----
dat <- read_csv(input_path, show_col_types = FALSE)
names(dat)[1] <- "cell_id"
stopifnot(all(c("x","y") %in% names(dat)))
coords <- dat %>% select(cell_id, x, y)
expr_df <- dat %>% select(-cell_id, -x, -y)
gene_names <- colnames(expr_df)
expr <- as.matrix(expr_df)
storage.mode(expr) <- "numeric"
# filter empty cells
libsize <- rowSums(expr)
keep <- libsize > 0
expr <- expr[keep, , drop = FALSE]
coords <- coords[keep, , drop = FALSE]
libsize <- libsize[keep]
# ---- log-normalization (counts per 10k, then log1p) ----
norm_counts <- sweep(expr, 1, libsize, "/") * 1e4
lognorm <- log1p(norm_counts)
# ---- PCA ----
set.seed(1)
pca <- prcomp_irlba(lognorm, n = 30, center = TRUE, scale. = TRUE)
pc1_load <- pca$rotation[, 1] # gene loadings
pc1_score <- pca$x[, 1] # cell scores
# ---- gene statistics ----
gene_mean <- colMeans(lognorm)
gene_var <- colVars(lognorm)
genes_tbl <- tibble(
gene = gene_names,
mean_logexpr = gene_mean,
var_logexpr = gene_var,
pc1_loading = pc1_load
)
# ---- top +/- loadings for a robust PC1 "signature" ----
k <- 10
top_pos <- genes_tbl %>% arrange(desc(pc1_loading)) %>% slice_head(n = k) %>% pull(gene)
top_neg <- genes_tbl %>% arrange(pc1_loading) %>% slice_head(n = k) %>% pull(gene)
pos_sig <- rowMeans(lognorm[, top_pos, drop = FALSE])
neg_sig <- rowMeans(lognorm[, top_neg, drop = FALSE])
pc1_signature <- pos_sig - neg_sig
plot_df <- coords %>% mutate(pc1_score = pc1_score, pc1_signature = pc1_signature)
# ---- Panels ----
p1 <- ggplot(genes_tbl, aes(x = mean_logexpr, y = pc1_loading, color = pc1_loading)) +
geom_point(alpha = 0.7, size = 1.6) +
scale_color_gradient2(low = muted("blue"), mid = "grey70", high = muted("red"), midpoint = 0) +
labs(title = "PC1 gene loading vs mean expression",
x = "Gene mean (log-normalized)", y = "PC1 loading") +
theme_minimal(base_size = 12) +
theme(legend.position = "none")
p2 <- ggplot(genes_tbl, aes(x = var_logexpr, y = pc1_loading, color = pc1_loading)) +
geom_point(alpha = 0.7, size = 1.6) +
scale_color_gradient2(low = muted("blue"), mid = "grey70", high = muted("red"), midpoint = 0) +
labs(title = "PC1 gene loading vs variance",
x = "Gene variance (log-normalized)", y = "PC1 loading") +
theme_minimal(base_size = 12) +
theme(legend.position = "none")
p3 <- ggplot(plot_df, aes(x = x, y = y, color = pc1_score)) +
geom_point(size = 0.25, alpha = 0.8) +
coord_fixed() +
scale_color_gradient2(low = muted("blue"), mid = "grey70", high = muted("red"), midpoint = 0) +
labs(title = "Spatial pattern of PC1 scores", x = "x", y = "y", color = "PC1 score") +
theme_minimal(base_size = 12)
p4 <- ggplot(plot_df, aes(x = x, y = y, color = pc1_signature)) +
geom_point(size = 0.25, alpha = 0.8) +
coord_fixed() +
scale_color_gradient2(low = muted("blue"), mid = "grey70", high = muted("red"), midpoint = 0) +
labs(title = paste0("Spatial PC1 signature: mean(top +", k, ") − mean(top −", k, ") genes"),
subtitle = paste0("Top+ genes: ", paste(top_pos, collapse = ", "),
"\nTop− genes: ", paste(top_neg, collapse = ", ")),
x = "x", y = "y", color = "Signature") +
theme_minimal(base_size = 12) +
theme(plot.subtitle = element_text(size = 9))
fig <- (p1 | p2) / (p3 | p4) +
plot_annotation(
title = "HW2: Dimensionality reduction (PCA) — loadings vs gene stats & spatial structure",
subtitle = "Panels A–B: gene-level relationships; Panels C–D: cell-level spatial patterns"
)
ggsave(out_path, fig, width = 14, height = 10, dpi = 300)
message("Saved figure to: ", out_path)