HW 4
Write a description to convince me you found the same cell-type.
Cluster 2 most likely represents proximal tubule S3 epithelial cells. This is supported by its inner stripe/outer medulla location highlighted in yellow in Panel C/F and Mep1a’s high expression in the same area in PCA and tSNE plots (Panels A/D and Panels B/E), matching a brush-border proximal tubule pattern. Additionally, the heatmap (Panel G) shows Mep1a rising alongside other transport/energy genes like Aqp1 and Acsm3, indicating a coordinated proximal tubule program. This aligns with a study by Herzog et al., which shows that meprin A (Mep1a) is a brush-border enzyme in proximal tubules and changes with acute kidney injury, reinforcing its S3 proximal tubule identity. https://pubmed.ncbi.nlm.nih.gov/17377510/
Write a description of what you changed and why you think you had to change it.
Using the Visium dataset, I normalized and log-transformed the data and performed PCA, in which the scree plot showed 5 points before the “elbow” of the plot. Previously, in the Xenium dataset, the scree plot included 9 points before reaching the “elbow” of the plot, but 9 pcs on the Visium dataset would have been redundant. Additionally, I performed t-SNE on the top 10 PCs because I wanted to better visualize the clusters and any local patterns, which I had not done previously.
I previously performed k-means clustering with k=8 on the Xenium dataset. With the Visium dataset, I performed k-means clustering with k=5. This is because I found the optimal K based on total-withinness to be only 5 for this dataset. One possible reason there is a lower optimal number of transcriptionally distinct cell clusters in the spot-based dataset compared to the single-cell resolution dataset is that in Visium, each spot mixes signals from multiple cells, potentially showing smoother patterns. Therefore, the Visium data could form fewer, broader groups, whereas finer clustering is needed for single-cell data.
Additionally, I changed the potential marker for Cluster 2 from Slc7a13 to Mep1a because, in the Xenium dataset, Slc7a13 often bled outside its cluster and wasn’t exclusive, making it less reliable as a defining marker. When observing Mep1a, it showed a stronger, more localized gene expression in Cluster 2 with the high z-score on the heatmap and aligned with the cluster’s PCA/t-SNE plots, indicating greater specificity.
Code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
# Read in data
data <- read.csv("Visium-IRI-ShamR_matrix.csv.gz")
data[1:5, 1:5]
pos <- data[, c('x', 'y')]
rownames(pos) <- data[,1]
gexp <- data[, 4:ncol(data)]
rownames(gexp) <- data[,1]
# Normalize and log transform data
totgexp <- rowSums(gexp)
head(totgexp)
head(sort(totgexp, decreasing=TRUE))
mat <- log10(gexp / totgexp * 1e6 + 1 )
dim(mat) # 1224 19465
# PCA
## no scaling
pcs <- prcomp(mat, center=TRUE, scale=FALSE)
names(pcs)
head(pcs$sdev)
length(pcs$sdev)
plot(pcs$sdev[1:50]) # scree plot reveals 5 points before the elbow
pc_scores <- pcs$x[, 1:4, drop=FALSE]
colnames(pc_scores) <- paste0("PC", 1:4)
# conduct tSNE
library(Rtsne)
pcs_use <- pcs$x[, 1:10, drop = FALSE]
tsne_out <- Rtsne(pcs_use, perplexity = 30, theta = 0.5, verbose = TRUE, check_duplicates = FALSE)
# determine k-centroids through scree plot
wcss <- wss <- sapply(2:12, function(k) {
kmeans(pcs_use, centers = k, nstart = 50, iter.max = 100)$tot.withinss
})
k_centroids <- 1:50
scree_df <- data.frame(
k = k_centroids,
wcss = wcss # makes no sense
)
scree_plot <- ggplot(scree_df, aes(x = k, y = wcss)) +
geom_line() +
geom_point(size = 2) +
labs(
title = "Scree plot for choosing k-centroids",
x = "Number of clusters",
y = "Total within-cluster Sum of Squares"
) +
theme_classic() +
theme(plot.title = element_text(size = 11))
# choose top 5 PCs
pcs_use <- pcs$x[, 1:5, drop=FALSE]
# run k-means
set.seed(123)
km <- kmeans(pcs_use, centers = 7, nstart = 100, iter.max = 300)
clusters <- factor(km$cluster)
df <- data.frame(
pcs$x[, 1:2, drop = FALSE],
clusters
)
# compute the centroid
centroid_pcs <- aggregate(cbind(PC1, PC2) ~ clusters, data = df, FUN = median)
# panel B
cell_cluster_pc <- ggplot(df, aes(PC1, PC2, color = clusters)) +
geom_point(alpha = 0.6, size = 2) +
labs(title = "All Clusters in PCA", x = "PC1", y = "PC2", color = "Cluster") +
theme_classic()
# panel c
# tsne coordinates
embedding_coords <- tsne_out$Y
colnames(embedding_coords) <- c("tSNE1", "tSNE2")
df <- data.frame(
tSNE1 = embedding_coords[,1],
tSNE2 = embedding_coords[,2],
clusters = factor(clusters)
)
# compute centroids for tsne
centroid_tsne <- aggregate(cbind(tSNE1, tSNE2) ~ clusters, data = df, FUN = median)
p_tsne <- ggplot(df, aes(tSNE1, tSNE2, color = clusters)) +
geom_point(alpha = 0.5) +
labs(title = "All Clusters in t-SNE space") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
# panel d
# Ensure clusters is a factor and aligned with pos
clusters_f <- factor(clusters[rownames(pos)])
# Data frame for plotting
df_sp <- data.frame(x = pos$x, y = pos$y, cluster = clusters_f)
# Spatial plot colored by cluster
ggplot(df_sp, aes(x, y, color = cluster)) +
geom_point(size = 2, alpha = 0.8) +
coord_equal() +
labs(title = "Spatial expression by cluster", x = "x", y = "y", color = "Cluster") +
theme_classic()
# Determined through the spatial expression that the cluster of interest is cluster 2
# as it has a very similiar spatial expression as the cluster of interest in HW3
# set up df for differential expression
library(matrixStats)
# trying to characterize cluster 2:
# Used AI: Prompt: help me identify genes upregulated in cluster 2
# vs all other clusters using a Wilcoxon rank-sum test on log-normalized data
# identify upregulated genes in cluster 2 vs other clusters
# Align to common cells to avoid mismatches
common_cells <- intersect(rownames(mat), names(clusters))
mat_use <- mat[common_cells, , drop = FALSE]
clusters_use <- clusters[common_cells]
genes <- colnames(mat_use)
# Define cluster 2 vs rest
cl_cells <- names(clusters_use)[clusters_use == 2]
ot_cells <- names(clusters_use)[clusters_use != 2]
# Extract matrices
mat_in <- mat_use[cl_cells, genes, drop = FALSE]
mat_out <- mat_use[ot_cells, genes, drop = FALSE]
# Wilcoxon rank-sum p-values (upregulated in cluster 2)
wilcox_p <- sapply(genes, function(g) {
wilcox.test(mat_in[, g], mat_out[, g], alternative = "greater")$p.value
})
wilcox_fdr <- p.adjust(wilcox_p, method = "BH")
# expression threshold
expr_threshold = log10(2)
# Effect sizes
mean_diff <- colMeans(mat_in) - colMeans(mat_out) # magnitude on log scale
pct_in <- colMeans(mat_in > expr_threshold) # detection in cluster 7
pct_out <- colMeans(mat_out > expr_threshold) # detection outside
pct_diff <- pct_in - pct_out
auroc <- sapply(genes, function(g) auroc_gene(mat_in[, g], mat_out[, g])) # separation
# Assemble results table
de_2 <- data.frame(
gene = genes,
wilcox_p = wilcox_p,
wilcox_fdr = wilcox_fdr,
mean_diff = mean_diff,
pct_in = pct_in,
pct_out = pct_out,
pct_diff = pct_diff,
stringsAsFactors = FALSE
)
# Used AI to add in auroc
# Prompt: How could I compute AUROC per gene?
# add in auroc
auroc_gene <- function(x_in, x_out) {
n_in <- length(x_in); n_out <- length(x_out)
r <- rank(c(x_in, x_out)) # average ranks for ties
U <- sum(r[1:n_in]) - n_in * (n_in + 1) / 2
U / (n_in * n_out) # AUC in [0,1]
}
de_2$auroc <- sapply(genes, function(g) auroc_gene(mat_in[, g], mat_out[, g]))
# Keep upregulated, significant genes and rank by strength (AUROC, then pct_diff, then mean_diff)
markers_2 <- subset(de_2, wilcox_fdr < 0.05 & mean_diff > 0)
markers_2 <- markers_2[order(-markers_2$auroc, -markers_2$pct_diff, -markers_2$mean_diff), ]
head(markers_2, 20)
# plot heatmap of cluster-averaged expression for top markers
# select top N upregulated markers for cluster 2
topN <- 30
top_genes <- head(markers_2$gene, topN)
top_genes <- top_genes[top_genes %in% colnames(mat)] # ensure present
# compute average expression per cluster for these genes
cl_levels <- sort(unique(clusters))
avg_by_cluster <- sapply(cl_levels, function(c) {
cells_c <- names(clusters)[clusters == c]
colMeans(mat[cells_c, top_genes, drop = FALSE])
})
# transform so that clusters are rows and genes are columns
avg_by_cluster <- t(avg_by_cluster)
colnames(avg_by_cluster) <- top_genes
rownames(avg_by_cluster) <- paste0("Cluster_", cl_levels)
# z-score genes across clusters
# purpose: make genes comparable versus using raw values for gene expression
z <- t(scale(t(avg_by_cluster)))
# Define color scale: blue (neg) → white (0) → red (pos)
# used AI to revise the look of the heat map
# prompt: could you revise this code so that the cluster labels are on the
# left side of the y-axis and that the gene labels are read from left to right.
# additionally, add a label for the legend "z-score" and set white = 0 with positive
# z score at red and negative at blue. additionally, remove clustering tree
library(pheatmap)
heatmap_plot <- pheatmap(
z,
cluster_rows = FALSE, # remove dendrogram (tree) for clusters
cluster_cols = FALSE, # remove dendrogram for genes
color = colorRampPalette(c("blue", "white", "red"))(100),
breaks = seq(-2, 2, length.out = 101), # ensures white centered at 0
main = "",
fontsize_row = 9,
fontsize_col = 9,
border_color = NA
)
# add legend title "z-score"
library(grid)
grid::grid.text("z-score", x = unit(0.97, "npc"), y = unit(0.9, "npc"), rot = 90, gp = gpar(cex = 0.9))
# add custom title
grid.text(
"Top Upregulated Markers from Cluster 2 across All Clusters",
x = 0.5,
y = 0.98,
gp = gpar(fontsize = 12, fontface = "plain")
)
ggsave("heatmap_plot.png", heatmap_plot, width = 6, height = 5, dpi = 300)
# Double check that Mep1a is differentially upregulated
# Define cluster 2 vs rest
cl2 <- names(clusters)[clusters == 2]
other <- names(clusters)[clusters != 2]
# Extract expression for Mep1a (log10 CPM+1)
x1 <- mat[cl2, "Mep1a"]
x2 <- mat[other, "Mep1a"]
# Wilcoxon rank-sum test (upregulated in cluster 2)
wilcox_p <- wilcox.test(x1, x2, alternative = "greater")$p.value
wilcox_fdr <- p.adjust(wilcox_p, method = "BH")
# Effect sizes
logFC <- mean(x1) - mean(x2)
thr <- log10(2)
pct_in <- mean(x1 > thr)
pct_out <- mean(x2 > thr)
# Print a brief summary
cat("Mep1a: p =", wilcox_p, "FDR =", wilcox_fdr,
"logFC =", round(logFC, 3),
"pct_in =", round(pct_in, 2),
"pct_out =", round(pct_out, 2), "\n")
# visualize Mep1a in reduced dimensional space (PCA)
Mep1a_df <- data.frame(PC1 = pcs$x[,1], PC2 = pcs$x[,2], Mep1a = mat[, "Mep1a"])
# panel E
# Mep1a
rng <- range(Mep1a_df$Mep1a, na.rm = TRUE)
ggplot(Mep1a_df, aes(PC1, PC2, color = Mep1a)) +
geom_point(size = 2, alpha = 0.8) +
labs(title = "Mep1a expression in PCA space (PC1 vs PC2)", x = "PC1", y = "PC2") +
scale_color_viridis_c(option = "C", limits = rng, name = "log10(expr+1)") +
theme_classic()
# panel f
ggplot(Mep1a_spatial_df, aes(x, y, color = Mep1a)) +
geom_point(size = 2, alpha = 0.8) +
scale_color_viridis_c(option = "C", limits = rng, name = "log10(Expression+1)") +
labs(title = "Spatial expression of Mep1a", x = "x", y = "y") +
coord_equal() + theme_classic()
# panel g
colnames(embedding_coords) <- c("tSNE1", "tSNE2")
df_tsne <- data.frame(tSNE1 = embedding_coords[,1], tSNE2 = embedding_coords[,2], Mep1a = mat[, "Mep1a"])
ggplot(df_tsne, aes(tSNE1, tSNE2, color = Mep1a)) +
geom_point(size = 2, alpha = 0.8) +
scale_color_viridis_c(option = "C", name = "log10(expr+1)") +
labs(title = "Mep1a expression in t-SNE space", x = "t-SNE 1", y = "t-SNE 2") +
theme_classic()