HW5


Siddhi Date
Hi I'm Siddhi

HW5

Description:

For this assignment, I first started out by following similar steps to my previous homeworks - normalizing the data, performing kmeans clustering by using the optimal k value, visualizing my results in pca space, and performing differential expression analysis (via the wilcox test) for upregulation and downregulation. I visualized the cluster data in spatial position as well as sanity checks. However, I noticed when visualizing in PCA space, that the kmeans clustering seemed quite scattered, and it seemed a little scattered when plotted in spatial locations as well. Due to this, I reduced the dimensionality of the data by applying PCA first, and then I performed kmeans clustering on the reduced data. I followed the other steps in the same order as previous after this step. I found through this method that the visualizations of clustering in space were more distinct. Thus I decided that this methodology, reducing dimensionality and then performing clustering, would be more suitable for the needs of this assignment, especially given how little we know about the dataset.
Figures 1 and 2 show the clustering in PC space and xy space respectively. Figures 3 and 4 display differential expression levels in the logarithmic scale for clusters 6 and 3 respectively. Figure 5 shows the upregulated CD21 expression in cluster 6, which very distinctly matches the region I want to capture. This visually indicates a strong relation between CD21 and my cell cluster of interest. Figure 6 shows PanCK upregulation expression in areas that also have downregulated CD31. Through iterations, I discovered that plotting both of these was necessary to get a visual representation specific to my cluster of interest. This shows an interesting characteristic of the data, specifically that in cluster 3, BOTH the CD31 downregulation and PanCK upregulation are needed to accurately characterize the cell. I am confident in my results, as I followed a consistent process and performed careful checks along the way. I performed dimensionality reduction on the data, then kmeans clustering, making sure to plot the total withinness for different k values before settling on the optimal “elbow” value fo k=8. Then I performed the wilcox tests for both upregulation and downregulation in my clusters of interest. Identifying the most statistically significant upregulation and downregulation allow me to make a stronger correlation to cell type, rather than just focusing on only up or down regulation. I also plotted my upregulation and downregulation spatially as a “sanity check” and made sure they match my original spatial clustering. I decided to investigate clusters 6 and 3. I picked these from my spatial visualization of clustering (Fig2.), as upon visual inspection they appear to be the most distinctly shaped structures. In cluster 6, I found the CD21 and Podoplanin to be among the most upregulated, and the CD31 and Lyve1 to be among the most downregulated. Literature suggests that upregulation of Podoplanin highly corresponds to Follicular Dendritic cells and so does CD21 upregulation, and downregulation of CD31 corresponds to T Lymphocytes, and downregulation of Lyve1 corresponds to lymphatic cells. Thus I suspect this cluster to be lymph cells, as FDC’s are found in B cell follicles of secondary lymphoid organs. In cluster 3, I found the PanCK and SMActin to be among the most upregulated, and the Ki67 and CD45RO to be among the most downregulated. Literature suggests that upregulation of PanCK and SMActin corresponds to endothelial cells, and downregulation of Ki67 can be associated with quiescent endothelial cells. CD45RO downregulation highly suggests this cell is not a T cell or an immune cell. Thus I suspect cluster 3 to be endothelial cells. Due to my findings indicative of T Lymphocytes and endothelial cell types, I would say that this tissue structure most represents white pulp, likely found in the spleen.

Sources:

https://pmc.ncbi.nlm.nih.gov/articles/PMC2196095/ https://pubmed.ncbi.nlm.nih.gov/18838918/ https://pubmed.ncbi.nlm.nih.gov/1544907/#:~:text=activated%20T%20lymphocytes-,The%20cell%20adhesion%20molecule%20CD31%20is%20phosphorylated%20after%20cell%20activation,(8):5243%2D9. https://www.jbc.org/article/S0021-9258(20)71702-3/pdf#:~:text=3B).,either%20transcription%20or%20mRNA%20turnover. https://journals.aai.org/jimmunol/article/160/3/1078/31037/Follicular-Dendritic-Cell-FDC-Precursors-in https://www.nature.com/articles/s41586-021-03570-8 https://www.ahajournals.org/doi/10.1161/CIRCULATIONAHA.121.055417#:~:text=Endothelial%20to%20mesenchymal%20transition%20(EndoMT)%20is%20a%20process%20of%20cell,are%20descendants%20of%20endocardial%20ECs. https://www.spandidos-publications.com/10.3892/ol.2015.3286 https://www.frontiersin.org/journals/immunology/articles/10.3389/fimmu.2012.00201/full

Code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318

library(ggplot2)
library(patchwork)
file <- 'C:\\Hopkins School Stuff\\GenomicDataVis\\data\\codex_spleen_3.csv.gz'
data <- read.csv(file)
data[1:5,1:10]
pos <- data[, 2:3]
rownames(pos) <- data$cell_id
head(pos)
area <-data[,4]
gexp <- data[, 5:ncol(data)]
rownames(gexp) <- data$barcode
head(gexp)
head(pos)

norm <-log10(gexp/rowSums(gexp) * 1e6 +1)
norm[1:5,1:5]


ks = c(2,3,4,5,6,7,8,9,10,11, 12, 13, 14, 15, 16)
totw <-sapply(ks,function(k){
  print(k)
  com <-kmeans(norm,centers=k)
  return(com$tot.withinss)
})
plot(ks,totw)
ggplot(pos, aes(x=x, y=y)) + geom_point()
##looks like elbow is at k=8, pick k=8
com <-kmeans(norm, centers=8)
clusters <- com$cluster
clusters <- as.factor(clusters)
names(clusters) <-rownames(gexp)
head(clusters)

pcs <-prcomp(norm)
df <- data.frame(pcs$x, clusters)
head(pcs)
pc_plot<-ggplot(df, aes(x=PC1, y=PC2, col=clusters)) + geom_point()+ labs(title = "PC's")
pc_plot
## Now Try doing PCA first, then clustering on the first two PC's

pcs <- prcomp(norm)
plot(pcs$sdev)
elbow <- sapply(2:20, function(k) {
  out <- kmeans(pcs$x[,1:2], centers=k)
  out$tot.withinss
})
plot(2:20, elbow)

clusters <- as.factor(kmeans(pcs$x[,1:2], centers=8)$cluster)
df <- data.frame(pcs$x[,1:2], clusters)
PC_PLOT<-ggplot(df, aes(x = PC1, y=PC2, col=clusters)) + geom_point()
df <- data.frame(pos, clusters)
SPAT_PLOT<-ggplot(df, aes(x = x, y=y, col=clusters)) + geom_point()
PC_PLOT
SPAT_PLOT

names(clusters) <- rownames(norm)  
ct1 <- names(clusters)[which(clusters == interest)]
ctother <- names(clusters)[which(clusters != interest)]
results <- sapply(colnames(norm), function(i) {
  wilcox.test(norm[ct1, i], norm[ctother, i], alternative = 'greater')$p.value ##upregulated
})
names(results) <- colnames(norm)
post_pcaClust_results<-sort(results[results < 0.05/ncol(norm)])

names(results) <- colnames(norm)
sort(results[results < 0.05/ncol(colnames)])


## Now Try doing PCA first, then clustering on the first two PC's

pcs <- prcomp(norm)
plot(pcs$sdev)
elbow <- sapply(2:20, function(k) {
  out <- kmeans(pcs$x[,1:2], centers=k)
  out$tot.withinss
})
plot(2:20, elbow)

clusters <- as.factor(kmeans(pcs$x[,1:2], centers=8)$cluster)
df <- data.frame(pcs$x[,1:2], clusters)
PC_PLOT<-ggplot(df, aes(x = PC1, y=PC2, col=clusters)) + geom_point()
df <- data.frame(pos, clusters)
SPAT_PLOT<-ggplot(df, aes(x = x, y=y, col=clusters)) + geom_point()
PC_PLOT
SPAT_PLOT

names(clusters) <- rownames(norm)  
ct1 <- names(clusters)[which(clusters == interest)]
ctother <- names(clusters)[which(clusters != interest)]
results <- sapply(colnames(norm), function(i) {
  wilcox.test(norm[ct1, i], norm[ctother, i], alternative = 'greater')$p.value ##upregulated
})
names(results) <- colnames(norm)
post_pcaClust_results<-sort(results[results < 0.05/ncol(norm)])

names(results) <- colnames(norm)
sort(results[results < 0.05/ncol(colnames)])


## for cluster 6
interest<-6
names(clusters) <- rownames(norm)  
ct1 <- names(clusters)[which(clusters == interest)]
ctother <- names(clusters)[which(clusters != interest)]
results <- sapply(colnames(norm), function(i) {
  wilcox.test(norm[ct1, i], norm[ctother, i], alternative = 'greater')$p.value ##upregulated
})
names(results) <- colnames(norm)
clust6_upreg<-sort(results[results < 0.05/ncol(norm)])

results <- sapply(colnames(norm), function(i) {
  wilcox.test(norm[ct1, i], norm[ctother, i], alternative = 'less')$p.value ##downregulated
})
names(results) <- colnames(norm)
clust6_downreg<-sort(results[results < 0.05/ncol(norm)])

clust6_upreg
clust6_downreg


## for cluster 3

interest<-3
names(clusters) <- rownames(norm)  
ct1 <- names(clusters)[which(clusters == interest)]
ctother <- names(clusters)[which(clusters != interest)]
results <- sapply(colnames(norm), function(i) {
  wilcox.test(norm[ct1, i], norm[ctother, i], alternative = 'greater')$p.value ##upregulated
})
names(results) <- colnames(norm)
clust3_upreg<-sort(results[results < 0.05/ncol(norm)])

results <- sapply(colnames(norm), function(i) {
  wilcox.test(norm[ct1, i], norm[ctother, i], alternative = 'less')$p.value ##downregulated
})
names(results) <- colnames(norm)
clust3_downreg<-sort(results[results < 0.05/ncol(norm)])

clust3_upreg
clust3_downreg


library(ggplot2)
##clust 6 volcano plot
interest <- 6
names(clusters) <- rownames(norm)
ct1 <- names(clusters)[which(clusters == interest)]
ctother <- names(clusters)[which(clusters != interest)]
upreg_results_6 <- sapply(colnames(norm), function(i) {
  wilcox.test(norm[ct1, i], norm[ctother, i], alternative = 'greater')$p.value
})
names(upreg_results_6) <- colnames(norm)
clust6_upreg <- sort(upreg_results_6[upreg_results_6 < 0.05 / ncol(norm)])
downreg_results_6 <- sapply(colnames(norm), function(i) {
  wilcox.test(norm[ct1, i], norm[ctother, i], alternative = 'less')$p.value
})
names(downreg_results_6) <- colnames(norm)
clust6_downreg <- sort(downreg_results_6[downreg_results_6 < 0.05 / ncol(norm)])

mean_ct1_6 <- colMeans(norm[ct1, , drop = FALSE])
mean_ctother_6 <- colMeans(norm[ctother, , drop = FALSE])
log2FC_6 <- (mean_ct1_6 - mean_ctother_6) * log2(10)

p_values_6 <- sapply(colnames(norm), function(i) {
  wilcox.test(norm[ct1, i], norm[ctother, i], alternative = 'two.sided')$p.value
})
neg_log10_pval_6 <- -log10(p_values_6)

volcano_data_6 <- data.frame(
  Gene = colnames(norm),
  log2FC = log2FC_6,
  p_value = p_values_6,
  neg_log10_pval = neg_log10_pval_6
)

volcano_data_6$Significance <- "Not Significant"
volcano_data_6$Significance[volcano_data_6$p_value < 0.05 / ncol(norm) & volcano_data_6$log2FC > 0] <- "Upregulated"
volcano_data_6$Significance[volcano_data_6$p_value < 0.05 / ncol(norm) & volcano_data_6$log2FC < 0] <- "Downregulated"

volc_6<-ggplot(volcano_data_6, aes(x = log2FC, y = neg_log10_pval, color = Significance)) +
  geom_point(alpha = 0.8) +
  scale_color_manual(values = c("Upregulated" = "red", "Downregulated" = "blue", "Not Significant" = "gray")) +
  theme_minimal() +
  xlab("Log2 Fold Change") +
  ylab("-log10(p-value)") +
  ggtitle("Fig.3") +
  theme(legend.position = "right") +
  geom_text(aes(label = ifelse(Significance %in% c("Upregulated", "Downregulated"), Gene, "")),
            vjust = -0.5, size = 3, color = "black", check_overlap = TRUE)


##clust 3 volcano plot
interest <- 3
names(clusters) <- rownames(norm)
ct1 <- names(clusters)[which(clusters == interest)]
ctother <- names(clusters)[which(clusters != interest)]

upreg_results_3 <- sapply(colnames(norm), function(i) {
  wilcox.test(norm[ct1, i], norm[ctother, i], alternative = 'greater')$p.value
})
names(upreg_results_3) <- colnames(norm)
clust3_upreg <- sort(upreg_results_3[upreg_results_3 < 0.05 / ncol(norm)])

downreg_results_3 <- sapply(colnames(norm), function(i) {
  wilcox.test(norm[ct1, i], norm[ctother, i], alternative = 'less')$p.value
})
names(downreg_results_3) <- colnames(norm)
clust3_downreg <- sort(downreg_results_3[downreg_results_3 < 0.05 / ncol(norm)])

mean_ct1_3 <- colMeans(norm[ct1, , drop = FALSE])
mean_ctother_3 <- colMeans(norm[ctother, , drop = FALSE])
log2FC_3 <- (mean_ct1_3 - mean_ctother_3) * log2(10)

p_values_3 <- sapply(colnames(norm), function(i) {
  wilcox.test(norm[ct1, i], norm[ctother, i], alternative = 'two.sided')$p.value
})
neg_log10_pval_3 <- -log10(p_values_3)

volcano_data_3 <- data.frame(
  Gene = colnames(norm),
  log2FC = log2FC_3,
  p_value = p_values_3,
  neg_log10_pval = neg_log10_pval_3
)

volcano_data_3$Significance <- "Not Significant"
volcano_data_3$Significance[volcano_data_3$p_value < 0.05 / ncol(norm) & volcano_data_3$log2FC > 0] <- "Upregulated"
volcano_data_3$Significance[volcano_data_3$p_value < 0.05 / ncol(norm) & volcano_data_3$log2FC < 0] <- "Downregulated"

volc_3<-ggplot(volcano_data_3, aes(x = log2FC, y = neg_log10_pval, color = Significance)) +
  geom_point(alpha = 0.8) +
  scale_color_manual(values = c("Upregulated" = "red", "Downregulated" = "blue", "Not Significant" = "gray")) +
  theme_minimal() +
  xlab("Log2 Fold Change") +
  ylab("-log10(p-value)") +
  ggtitle("Fig.4") +
  theme(legend.position = "right") +
  geom_text(aes(label = ifelse(Significance %in% c("Upregulated", "Downregulated"), Gene, "")),
            vjust = -0.5, size = 3, color = "black", check_overlap = TRUE)


##verifying via spatial plotting 

## Cluster 6 - Upregulation

cd21_expr <- norm[, "CD21"]  

df_spatial <- data.frame(pos, clusters, CD21 = cd21_expr, Podoplanin = podoplanin_expr)

# Spatial plot for CD21 expression
clust_6_proof<-ggplot(df_spatial, aes(x = x, y = y, color = CD21)) +
  geom_point() +
  scale_color_gradient(low = "gray", high = "orange") +
  labs(title = "Fig.5", 
       x = "X Position", y = "Y Position", color = "CD21 Expression") +
  theme_minimal() +
  theme(legend.position = "right")


## Cluster 3

panck_expr <- norm[, "PanCK"]  
df_spatial <- data.frame(pos, clusters, PanCK = panck_expr)

# Spatial plot for PanCK expression across all clusters
ggplot(df_spatial, aes(x = x, y = y, color = PanCK)) +
  geom_point() +
  scale_color_gradient(low = "gray", high = "purple") +  # Upregulation color scheme
  labs(title = "Spatial Distribution of PanCK Expression (Up)", 
       x = "X Position", y = "Y Position", color = "PanCK Expression") +
  theme_minimal() +
  theme(legend.position = "right")


cd31_expr <- norm[, "CD31"]  
df_spatial <- data.frame(pos, clusters, CD31 = cd31_expr)

# Spatial plot for CD31 expression across all clusters 
ggplot(df_spatial, aes(x = x, y = y, color = CD31)) +
  geom_point() +
  scale_color_gradient(low = "purple", high = "gray") +  
  labs(title = "Spatial Distribution of CD31 Expression (Down)", 
       x = "X Position", y = "Y Position", color = "CD31 Expression") +
  theme_minimal() +
  theme(legend.position = "right")


# Spatial plot for cells with upregulation of PanCK and downregulation of CD31
panck_expr <- norm[, "PanCK"]
cd31_expr <- norm[, "CD31"]
panck_upregulated <- panck_expr > median(panck_expr)
cd31_downregulated <- cd31_expr < median(cd31_expr)
selected_cells <- panck_upregulated & cd31_downregulated
df_spatial_selected <- data.frame(pos[selected_cells, ], clusters[selected_cells], 
                                  PanCK = panck_expr[selected_cells], CD31 = cd31_expr[selected_cells])

clust3_proof<-ggplot(df_spatial_selected, aes(x = x, y = y, color = PanCK)) +
  geom_point() +
  scale_color_gradient(low = "gray", high = "purple") +  
  labs(title = "Fig.6", 
       x = "X Position", y = "Y Position", color = "PanCK Expression") +
  theme_minimal() +
  theme(legend.position = "right")




library(patchwork)
# Combine plots using patchwork
final_plot <- (PC_PLOT | SPAT_PLOT) / 
  (volc_6 | volc_3) / 
  (clust_6_proof | clust3_proof)

# Display the final combined plot
final_plot