EC1- tSNE on genes vs on PCs

Describe your figure briefly so we know what you are depicting (you no longer need to use precise data visualization terms as you have been doing).
The data was normalized by the count and then log transformed. This is an animation of 3 plots.
I first performed tSNE on gene expressions. The first plot encodes the expression of PTPRC as a gradient of color hue from gray to red in the tSNE space when tSNE was performed on genes. We see several distinct clusters, and some clusters from the bottom half of the tSNE plot seem to upregulate PTPRC. The choice of PTPRC would be explained later.
I plotted a scree plot of the standard deviations over the number of PCs and observed an elbow at about the 8th PC. This indicates that most variance is being captured by the first 8 PCs.Thus, I performed tSNE on the first 8 PCs. The second plot encodes the expression of PTPRC as a gradient of color hue from gray to red in the tSNE space when tSNE was performed on the first 8 PCs. We see that the clusters seem less distinct, which makes sense because information on some genes are lost. But still, we see PTPRC being upregulated in clusters from the bottom half of the tSNE plot.
From the scree plot, we see that the standard deviation is still pretty high for the 3rd PC. But for the sake of exploration, I performed tSNE on the first 2 PCs. The third plot encodes the expression of PTPRC as a gradient of color hue from gray to red in the tSNE space when tSNE was performed only on the first 2 PCs. Here, we lost the pattern of upregulation of PTPRC in specific clusters.
PTPRC was specifically chosen because it is the gene with the highest loading (highest contribution) in the 3rd PC, and so I expect the expression of PTPRC in tSNE space to be different when tSNE is performed only on the first 2 PCs. Variation in PC3 is highly influenced by PTPRC expression. If we include PC3, where PTPRC contributes the most, the tSNE structure will reflect PTPRC’s expression pattern more strongly. If we exclude PC3 (only use the first two PCs), the tSNE representation will be missing the key variation from PTPRC, causing a different cluster structure or spatial distribution.
From this exploration, we see that we retain the most information when we perform non-linear dimensionality reduction on genes directly. However, to save time from running through the large number of genes, we might want to perform linear dimensionality reduction on the genes first before performing non-linear dimensionality reduction on those data (the PCs). In this case, it is important that we use the number of PCs that capture most of the variance of the genes such that we don’t lose important information.
Code (paste your code in between the ``` symbols)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
file <- "data/codex_spleen_3.csv.gz"
data <- read.csv(file)
data[1:5,1:10]
pos <- data[, 1:4]
rownames(pos) <- data$barcode
head(pos)
exp <- data[, 4:ncol(data)]
rownames(exp) <- data$barcode
exp[1:5,1:5]
dim(exp)
#proteomics- 28
norm<- exp
norm<- log10(exp/exp$area +1)
norm[1:5,1:5]
## try many ks
ks = seq.int(1, 40, 2)
totw <- sapply(ks, function(k) {
print(k)
set.seed(1)
com <- kmeans(norm, centers=k)
return(com$tot.withinss)
})
## find optimal k from elbow plot
library(ggplot2)
df0<-data.frame(ks,totw)
p0<-ggplot(df0, aes(x=ks, y=totw)) + geom_point(size=3)+
labs(
title = "Total Withinness for different k's",
x = "Number of k",
y = "Total Withinness"
) +
theme_bw()
p0
## elbow at roughly k=20
set.seed(1)
com<- kmeans(norm,centers=20)
clusters<-as.factor(com$cluster)
clusters
names(clusters) <- rownames(norm)
head(clusters)
## Plot 1: A panel visualizing all clusters in reduced dimensional space (tSNE)
emb <- Rtsne::Rtsne(norm)
head(emb$Y)
df1 <- data.frame(emb$Y, clusters=clusters)
p1<-ggplot(df1, aes(x=X1, y=X2, col=clusters)) + geom_point(size=1)+
labs(
title = "Clusters in tSNE Space",
color = "Cluster",
x = "tSNE1",
y = "tSNE2"
) +
theme_bw()
p1
##Plot 2: A panel visualizing all clusters in physical space
df2 <- data.frame(aligned_x = data$x, aligned_y = data$y, emb$Y, clusters)
p2<- ggplot(df2) + geom_point(aes(x = aligned_x, y = aligned_y,
color= clusters), size=1) +
labs(
title = "Clusters in Physical Space",
color = "Cluster",
x = "Aligned X",
y = "Aligned Y"
) +
theme_bw()
p2
## characterizing cluster 9
interest <- 9
cellsOfInterest<-names(clusters)[clusters==interest]
OtherCells<-names(clusters)[clusters!=interest]
ClusterOfInterest<- ifelse(clusters=='9','Cluster Of Interest','Others')
## Plot 3: A panel visualizing your one cluster of interest (cluster 9) in reduced dimensional space (tSNE)
df3 <- data.frame(emb$Y, clusters=clusters,ClusterOfInterest)
p3<-ggplot(df3, aes(x=X1, y=X2, col=ClusterOfInterest)) + geom_point(size=1) +
scale_color_manual(values = c("red","gray")) +
labs(
title = "Cluster 9 vs Others in tSNE Space",
color = "Cluster",
x = "tSNE1",
y = "tSNE2"
) +
theme_bw()
p3
##Plot 4: A panel visualizing your one cluster of interest in physical space
df4 <- data.frame(aligned_x = data$x, aligned_y = data$y, emb$Y, clusters,ClusterOfInterest)
p4<- ggplot(df4) + geom_point(aes(x = aligned_x, y = aligned_y,
color= ClusterOfInterest), size=1) +
scale_color_manual(values = c("red","gray")) +
labs(
title = "Cluster 9 vs Others in Physical Space",
color = "Cluster",
x = "Aligned X",
y = "Aligned Y"
) +
theme_bw()
p4
?wilcox.test
# do wilcox for DE genes
interest<-9
pv <- sapply(colnames(norm), function(i) {
print(i) ## print out gene name
wilcox.test(norm[clusters == interest, i], norm[clusters != interest, i],alternative='greater')$p.val
})
head(sort(pv)) # CD21 HLA.DR CD20 CD35 CD44 CD1c
logfc <- sapply(colnames(norm), function(i) {
print(i) ## print out gene name
log2(mean(norm[clusters == interest, i])/mean(norm[clusters != interest, i]))
})
## Plot 5: volcano plot
df <- data.frame(pv=-(log10(pv+1e-100)), logfc,genes=names(pv))
# add gene names
df$genes <- rownames(df)
# add labeling for 10 fold change
df$delabel <- ifelse(df$logfc > 1, df$genes, NA)
df$delabel <- ifelse(df$logfc < -1.5, df$genes, df$delabel)
# add if DE
df$diffexpressed <- ifelse(df$logfc > 1, "Upregulated", "Not Significant") #2 or 1.5
df$diffexpressed <- ifelse(df$logfc < -1.5, "Downregulated", df$diffexpressed)
library(ggrepel)
# plot
p5<-ggplot(df, aes(x = logfc, y = pv, label = delabel, color = diffexpressed)) +
geom_point(size= 0.75) +
geom_vline(xintercept = c(-1.5, 1), col = "gray", linetype = 'dashed') +
geom_hline(yintercept = -log10(0.05), col = "gray", linetype = 'dashed') +
scale_color_manual(values = c("#00AFBB", "grey", "red"),
labels = c("Downregulated", "Not significant", "Upregulated")) +
# theme
theme_classic() +
# labels
labs(color = 'Gene Significance',
x = expression("log"[2]*"FC"),
y = expression("-log"[10]*"p-value"),
title = "Gene differences for Cluster 9 vs Others") +
geom_text_repel(aes(label = delabel), na.rm = TRUE,
max.overlaps = Inf, box.padding = 0.25, point.padding = 0.25, min.segment.length = 0, size = 4, color = "black") +
scale_x_continuous(breaks = seq(-5, 5, 1)) +
guides(size = "none",color = guide_legend(override.aes = list(size=5)))
p5
## Plot 9: A panel visualizing one of these genes (CD21) in reduced dimensional space (tSNE)
df9 <- data.frame(emb$Y, clusters, gene=norm[,'CD21'])
p9<-ggplot(df9, aes(x=X1, y=X2, col=gene)) + geom_point(size=0.5) +
scale_color_gradient(low = 'lightgrey', high='red') +
labs(
title = "Expression of CD21 in tSNE Space",
color = "CD21",
x = "tSNE1",
y = "tSNE2"
) +
theme_bw()
p9
p4
## Plot 10: A panel visualizing one of these genes in space
df10 <- data.frame(aligned_x = data$x, aligned_y = data$y, emb$Y, clusters, gene=norm[,'CD21'],ClusterOfInterest)
p10<-ggplot(df10) + geom_point(aes(x = aligned_x, y = aligned_y,
color= gene), size=1.5) +
scale_color_gradient(low = 'lightgrey', high='red') +
labs(
title = "Expression of CD21 in Physical Space",
color = "CD21",
x = "Aligned X",
y = "Aligned Y"
) +
theme_bw()
p10
## characterizing cluster 7
interest <- 7
cellsOfInterest<-names(clusters)[clusters==interest]
OtherCells<-names(clusters)[clusters!=interest]
ClusterOfInterest<- ifelse(clusters=='7','Cluster Of Interest','Others')
## Plot 6: A panel visualizing your one cluster of interest (cluster 7) in reduced dimensional space (tSNE)
df6 <- data.frame(emb$Y, clusters=clusters,ClusterOfInterest)
p6<-ggplot(df6, aes(x=X1, y=X2, col=ClusterOfInterest)) + geom_point(size=1) +
scale_color_manual(values = c("red","gray")) +
labs(
title = "Cluster 7 vs Others in tSNE Space",
color = "Cluster",
x = "tSNE1",
y = "tSNE2"
) +
theme_bw()
p6
##Plot 7: A panel visualizing your one cluster of interest in physical space
df7 <- data.frame(aligned_x = data$x, aligned_y = data$y, emb$Y, clusters,ClusterOfInterest)
p7<- ggplot(df7) + geom_point(aes(x = aligned_x, y = aligned_y,
color= ClusterOfInterest), size=1) +
scale_color_manual(values = c("red","gray")) +
labs(
title = "Cluster 7 vs Others in Physical Space",
color = "Cluster",
x = "Aligned X",
y = "Aligned Y"
) +
theme_bw()
p7
?wilcox.test
# do wilcox for DE genes
interest<-7
pv <- sapply(colnames(norm), function(i) {
print(i) ## print out gene name
wilcox.test(norm[clusters == interest, i], norm[clusters != interest, i],alternative='greater')$p.val
})
head(sort(pv)) # CD8 CD3e CD45 CD45RO CD163 CD44 CD1c
logfc <- sapply(colnames(norm), function(i) {
print(i) ## print out gene name
log2(mean(norm[clusters == interest, i])/mean(norm[clusters != interest, i]))
})
df
## Plot 8: volcano plot for cluster 7
df8 <- data.frame(pv=-(log10(pv+1e-100)), logfc,genes=names(pv))
# add gene names
df8$genes <- rownames(df8)
# add labeling for 10 fold change
df8$delabel <- ifelse(df8$logfc > 1, df8$genes, NA)
df8$delabel <- ifelse(df8$logfc < -0.7, df8$genes, df8$delabel)
# add if DE
df8$diffexpressed <- ifelse(df8$logfc > 1, "Upregulated", "Not Significant") #2 or 1.5
df8$diffexpressed <- ifelse(df8$logfc < -0.7, "Downregulated", df8$diffexpressed)
library(ggrepel)
# plot
p8<-ggplot(df8, aes(x = logfc, y = pv, label = delabel, color = diffexpressed)) +
geom_point(size= 0.75) +
geom_vline(xintercept = c(-0.7, 1), col = "gray", linetype = 'dashed') +
geom_hline(yintercept = -log10(0.05), col = "gray", linetype = 'dashed') +
scale_color_manual(values = c("#00AFBB", "grey", "red"),
labels = c("Downregulated", "Not significant", "Upregulated")) +
# theme
theme_classic() +
# labels
labs(color = 'Gene Significance',
x = expression("log"[2]*"FC"),
y = expression("-log"[10]*"p-value"),
title = "Gene differences for Cluster 7 vs Others") +
geom_text_repel(aes(label = delabel), na.rm = TRUE,
max.overlaps = Inf, box.padding = 0.25, point.padding = 0.25, min.segment.length = 0, size = 4, color = "black") +
scale_x_continuous(breaks = seq(-5, 5, 1)) +
guides(size = "none",color = guide_legend(override.aes = list(size=5)))
p8
## Plot 11: A panel visualizing one of these genes (CD8) in reduced dimensional space (tSNE)
df11 <- data.frame(emb$Y, clusters, gene=norm[,'CD8'])
p11<-ggplot(df11, aes(x=X1, y=X2, col=gene)) + geom_point(size=0.5) +
scale_color_gradient(low = 'lightgrey', high='red') +
labs(
title = "Expression of CD8 in tSNE Space",
color = "CD8",
x = "tSNE1",
y = "tSNE2"
) +
theme_bw()
p11
p6
## Plot 12: A panel visualizing one of these genes in space
df12 <- data.frame(aligned_x = data$x, aligned_y = data$y, emb$Y, clusters, gene=norm[,'CD8'],ClusterOfInterest)
p12<-ggplot(df12) + geom_point(aes(x = aligned_x, y = aligned_y,
color= gene), size=1.5) +
scale_color_gradient(low = 'lightgrey', high='red') +
labs(
title = "Expression of CD8 in Physical Space",
color = "CD8",
x = "Aligned X",
y = "Aligned Y"
) +
theme_bw()
p12
p7
install.packages("gridExtra") # Install if not already installed
library(gridExtra) # Load the package
library(ggplot2)
# Set up the PNG output
png("white_pulp_analysis.png", width = 1200, height = 2000, res = 150)
## Combine plots together
lay <- rbind(c(1, 1),
c(2, 3),
c(4, 5),
c(6, 6)
)
grid1<-grid.arrange(p0,
p1,p2,
p3,p4,
p5,
layout_matrix = lay,
top = "Analyzing for White Pulp Within Spleen")
## Combine plots together
lay <- rbind(
c(7, 8),
c(9, 10),
c(11, 11),
c(12, 13)
)
grid2<-grid.arrange(
p9,p10,
p6,p7,
p8,
p11,p12,
layout_matrix = lay,
top = "Analyzing for White Pulp Within Spleen")
dev.off()
library(patchwork)
png("white_pulp_analysis.png", width = 1200, height = 2000, res = 150)
p0
dev.off
p1+p2
p3+p4
p5
p9+p10
p6+p7
p8
p11+p12
## Combine plots together
lay <- rbind(c(1,1),
c(2, 3),
c(4, 5)
)
install.packages("gridExtra") # Install if not already installed
library(gridExtra) # Load the package
library(ggplot2)
grid.arrange(p0,
p1,p2,
p3,p4,
layout_matrix = lay,
top = "Analyzing for White Pulp Within Spleen")
Code for volcano plot referenced https://jef.works/genomic-data-visualization-2024/blog/2024/02/14/challin1/
The final png was merged on https://products.groupdocs.app/merger/png#folderName=dd1be75e-700e-4c83-b0c3-8787377d0c58 because the the viewport on R is too small for my figure.