Multi-Panel Data Visualization of Epithelial Cell Cluster in Eevee Dataset

This figure presents an analysis of cellular clusters within the Eevee dataset, focusing on the identification and characterization of a biologically relevant cluster using k-means clustering, dimensionality reduction techniques (PCA and t-SNE), differential gene expression analysis, and statistical analysis through volcano plots. The goal is to determine whether the identified cluster corresponds to the epithelial-like population previously characterized in the Pikachu dataset of HW3, despite differences in the dataset resolution and clustering structure. Principal Component Analysis (PCA) was applied to reduce the dimensionality of the gene expression dataset, improving clustering performance by filtering noise. K-means clustering was then performed on the top principal components to obtain the optimal number of clusters. The top-left PCA plot and top-middle t-SNE plot visualize these clusters, with the cluster of interest highlighted in purple, distinguishing it from the remaining cells (highlighted in green). The top-right physical space plot further confirms that this cluster exhibits a distinct spatial organization within the tissue sample. Furthermore, to identify the optimal PCs needed to account for meaningful variance of the Eevee dataset, visual comparisons were completed to change the number of optimal PCs from 20 in the Pikachu code to 7. In addition to this, the elbow method was additionally utilized to identify the optimal number of clusters (k) for the data. Through identifying the elbow point, the optimal number of clusters was changed from the previous value of 5 to 6. This allowed for the optimal number of clusters for the Eevee dataset. Finally, the presence of the top 2 highly expressed genes from HW3, KRT5 and KRT17, was tested within each cluster of the Eevee dataset to determine that the new cluster of interest would be Cluster 1 (rather than Cluster 2) in order to identify the adequate cell type cluster of epithelial cells with keratinization characteristics.
To assess the biological relevance of this cluster, differentially expressed genes were identified using the Wilcoxon rank-sum test, followed by visualization in the volcano plot (bottom panel). This analysis revealed a set of significantly upregulated genes—KRT17, KRT6B, LAMC2, LAMB3, ACTG2, and KLK5—that distinguish this cluster. Compared to the Pikachu dataset, where epithelial markers such as KRT5, KRT15, KRT16, ELF3, and CEACAM8 were dominant, the Eevee dataset highlights a shift in gene expression patterns, reflecting potential differences in dataset resolution and biological context. However, through further analysis with the Human Protein Atlas, each of these upregulated genes has well-documented associations with epithelial differentiation, extracellular matrix remodeling, and, in some cases, tumorigenesis. KRT17 (Keratin 17), KRT6B (Keratin 6B), LAMC2 (Laminin Subunit Gamma-2), LAMB3 (Laminin Subunit Beta-3), KLK5 (Kallikrein-Related Peptidase 5), and COL17A1 (Collagen type XVII alpha 1 chain) are all shown to exhibit the same squamous epithelial cell line expression that is specific to keratinization, while ACTG2 is found to be expressed in connective tissue cell lines with ECM organization. With this result, it is therefore clear this code identifies the same epithelial cell cluster with tumorigenic characteristics that was previously identified within the Pikachu dataset.
To further confirm the biological identity of the cluster, KRT17 expression levels were mapped in PCA space (middle-left), t-SNE space (middle-center), and physical space (middle-right). The strong enrichment of KRT17 within the cluster supports its epithelial-like properties and potential stress-induced transformation. As well as this, the observed upregulation of basement membrane-associated genes (LAMC2, LAMB3), cytoskeletal regulators (ACTG2), and proteases (KLK5) further indicates a highly dynamic and possibly invasive epithelial population. By adapting the analytical workflow from HW3, I successfully identified an epithelial-like cluster in the Eevee dataset, corresponding to the previously analyzed cluster in the Pikachu dataset. The observed shift from KRT5/KRT15-based markers in Pikachu to KRT17/KRT6B/LAMC2 in Eevee suggests dataset-specific variability with the dramatic increase in genes represented while maintaining epithelial characteristics. The integration of clustering, spatial mapping, and differential gene expression analysis strongly supports the presence of epithelial-enriched features within this cluster. Furthermore, the upregulated genes’ known involvement in tumorigenesis, EMT, and ECM remodeling (validated through HPA and literature sources) provides strong biological justification for the classification of this cluster.
Sources:
- Human Protein Atlas: https://www.proteinatlas.org/
- https://www.ncbi.nlm.nih.gov/gene/3872
- https://www.ncbi.nlm.nih.gov/gene/3854
- https://www.ncbi.nlm.nih.gov/gene/3914
- https://www.ncbi.nlm.nih.gov/gene/3918
- https://www.ncbi.nlm.nih.gov/gene/25818
- https://www.ncbi.nlm.nih.gov/gene/72
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
# Necessary libraries loaded
library(ggplot2)
library(Rtsne)
library(patchwork)
library(dplyr)
library(cluster)
library(factoextra)
library(ggrepel)
# Loads Eevee dataset
file <- '~/Desktop/genomic-data-visualization-2025/data/eevee.csv.gz'
data <- read.csv(file)
# Extracts spatial coordinates and gene expression data from aligned_x and aligned_y columns
pos <- data[, c('aligned_x', 'aligned_y')]
rownames(pos) <- data$cell_id
gexpsub <- data[, 7:ncol(data)]
rownames(gexpsub) <- data$cell_id
# Log-transforms gene expression data and normalizes gexp
normgexpsub <- log10(gexpsub/rowSums(gexpsub) * mean(rowSums(gexpsub)) + 1)
# Performs PCA on gene expression data
pcs <- prcomp(normgexpsub)
plot(pcs$sdev[1:40])
# Finds optimal PCs for kMeans
plot(1:20, pcs$sdev[1:20], type = "l")
plot(1:10, pcs$sdev[1:10], type = "l")
# 7 PCs accounts for meaningful variance of the eevee dataset
pc_opt <- 7
# Perform t-SNE on PCA-reduced gene expression data
set.seed(5) # Ensures reproducibility
# Determine optimal number of k's (centroids) for kMeans using elbow method
elbow <- fviz_nbclust(normgexpsub, kmeans, method = "wss") +
geom_vline(xintercept = 5, linetype = 2) +
labs(subtitle = "Elbow Method")
elbow
# 6 cluster k's accounts for meaningful variance of the eevee dataset
cluster <- 6
# Perform k-means clustering with k = 6
tsne_emb <- Rtsne(pcs$x[,1:pc_opt])$Y
com <- kmeans(tsne_emb, cluster)
# Creates a data frame for t-SNE visualization
df_tsne <- data.frame(tsne_emb)
colnames(df_tsne) <- c("tSNE1", "tSNE2")
df_tsne$clusters <- as.factor(com$cluster)
# Assigns clusters to data frames
df_clusters <- data.frame(aligned_x = pos$aligned_x, aligned_y = pos$aligned_y,
tSNE1 = tsne_emb[, 1], tSNE2 = tsne_emb[, 2], kmeans = as.factor(com$cluster))
df_pca <- data.frame(pcs = pcs$x, kmeans = as.factor(com$cluster))
df_pos <- data.frame(pos, clusters = as.factor(com$cluster))
colnames(df_pca)
head(df_pca)
str(df_pca)
# Visualizes the clusters in PC1 vs. PC2 space
pc_cluster <- ggplot(data = df_pca, aes(x = pcs.PC1, y = pcs.PC2, col = cluster)) +
geom_point(size = 1) +
theme_minimal() +
labs(title = "Cell Clusters in PC1 v.s. PC2 Space") +
theme(plot.title = element_text(hjust = 0.5)) +
guides(color = guide_legend(override.aes = list(size = 2.5)))
pc_cluster
# Visualize the clusters in physical space
physical_cluster <- ggplot(data = df_clusters, aes(x = aligned_x, y = aligned_y, col = cluster)) +
geom_point(size = 1) +
theme_minimal() +
labs(title = "Cell Clusters in Physical Space", x = "Aligned X Position", y= "Aligned Y Position") +
theme(plot.title = element_text(hjust = 0.5)) +
guides(color = guide_legend(override.aes = list(size = 2.5)))
physical_cluster
## The top 5 genes identified prior are in prior_gene_list
prior_gene_list <- list("KRT5", "KRT15", "KRT16", "KRT23", "ELF3", "CEACAM8", "PIGR")
for (gene in prior_gene_list) {
# Check if the gene exists as a column name in df_cluster
if (gene %in% colnames(df_clusters)) {
print(gene)
}
}
# Defines a specific gene to visualize (highest expressed gene from hw 3)
selected_gene <- "KRT5"
# Assigns selected gene expression values as numeric value
df_pca$gene <- as.numeric(normgexpsub[, selected_gene])
# Visualizes where KRT5 expression is high
krt5_PCA <- ggplot(data = df_pca, aes(x = pcs.PC1, y = pcs.PC2, col = gene)) +
geom_point(size = 1) +
theme_bw() +
scale_color_gradient(low = 'lightgrey', high = 'blue') +
labs(x = "Principal Component 1", y = "Principal Component 2", color = "KRT5", title = "KRT5 Expression in PCA Space") +
theme(plot.title = element_text(hjust = 0.5))
krt5_PCA
# Defines a specific gene to visualize (second highest expressed gene from hw 3)
selected_gene <- "KRT15"
# Assigns selected gene expression values as numeric value
df_pca$gene <- as.numeric(normgexpsub[, selected_gene])
# Visualizes where KRT15 expression is high
krt15_PCA <- ggplot(data = df_pca, aes(x = pcs.PC1, y = pcs.PC2, col = gene)) +
geom_point(size = 1) +
theme_bw() +
scale_color_gradient(low = 'lightgrey', high = 'blue') +
labs(x = "Aligned X Position", y = "Algned Y Position", color = "KRT15", title = "DSP Expression in PCA Space") +
theme(plot.title = element_text(hjust = 0.5))
krt15_PCA
# Defines cluster of interest
cluster_interest <- 1
# Visualizes cluster of interest, Cluster 1, in PCA space
p1 <- ggplot(df_pca) + geom_point(aes(x = pcs.PC1, y = pcs.PC2,
col = ifelse(kmeans == cluster_interest, "Cluster of Interest", "All Other Cells")),
size = 0.01) +
theme_minimal() +
labs(title = "PCA Clustering of\n Cluster of Interest",
x = "Principal Component 1", y = "Principal Component 2", color = "Clustering on tSNE") +
theme(
plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
axis.title = element_text(size = 8),
axis.text = element_text(size = 6)
) + guides(color = guide_legend(override.aes = list(size = 2))) +
scale_color_manual(values = c("lightgreen", "purple"))
p1 + krt5_PCA + krt15_PCA # Looks like cluster 1 should be cluster of interest
# Visualizes cluster of interest, Cluster 1, in t-SNE space
p2 <- ggplot(df_clusters) +
geom_point(aes(x = tSNE1, y = tSNE2,
col = ifelse(kmeans == cluster_interest, "Cluster of Interest", "All Other Cells")),
size = 0.01) +
theme_minimal() +
labs(title = "t-SNE Embedded Space\n of Cluster of Interest",
x = "tSNE1", y = "tSNE2", color = "Clustering on tSNE") +
theme(
plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
axis.title = element_text(size = 8),
axis.text = element_text(size = 6)
) + guides(color = guide_legend(override.aes = list(size = 2))) +
scale_color_manual(values = c("lightgreen", "purple"))
# Visualizes cluster of interest, Cluster 1, in physical space
p3 <- ggplot(df_clusters) +
geom_point(aes(x = aligned_x, y = aligned_y,
col = ifelse(kmeans == cluster_interest, "Cluster of Interest", "All Other Cells")),
size = 0.01) +
theme_minimal() +
labs(title = "Physical Spatial Distribution\n of Cluster of Interest",
x = "Physical X-position", y = "Physical Y-Position", color = "Clustering on tSNE") +
theme(
plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
axis.title = element_text(size = 8),
axis.text = element_text(size = 6)
) + guides(color = guide_legend(override.aes = list(size = 2))) +
scale_color_manual(values = c("lightgreen", "purple"))
# Visualizes differentially expressed genes for cluster of interest, Cluster 1
com_categories = as.factor(com$cluster)
p_value = sapply(colnames(normgexpsub), function(i){
print(i)
wilcox.test(normgexpsub[com_categories == cluster_interest, i], normgexpsub[com_categories!= cluster_interest, i])$p.val
})
logfc = sapply(colnames(normgexpsub), function(i){
print(i)
log2(mean(normgexpsub[com_categories == cluster_interest, i])/mean(normgexpsub[com_categories != cluster_interest, i]))
})
# Creates a volcano plot dataframe
df_volcano <- data.frame(
p_value = -log10(p_value + 1e-100),
logfc,
genes = colnames(normgexpsub)
)
# Removes NA values from df_volcano to reduce warning message error
df_volcano <- df_volcano[complete.cases(df_volcano), ]
# Adjusts threshold for gene labeling to avoid excessive overlap of labels
df_volcano$gene_label <- ifelse(
df_volcano$p_value > 10 & abs(df_volcano$logfc) > 1.5,
as.character(df_volcano$genes),
NA
)
#Assigns number of labels as 7 to allow for proper labeling without excessive overlap
num_labels_down <- 7
num_labels_up <- 7
top_down <- df_volcano %>% filter(logfc < -1) %>% top_n(num_labels_down, wt = p_value)
top_up <- df_volcano %>% filter(logfc > 1) %>% top_n(num_labels_up, wt = p_value)
df_volcano$gene_label <- ifelse(
df_volcano$genes %in% c(top_down$genes, top_up$genes),
as.character(df_volcano$genes),
NA
)
# Generates volcano plot
volcano_plot <- ggplot(df_volcano) +
geom_point(aes(x = logfc, y = p_value, col = ifelse(logfc > 0, 'Upregulated', 'Downregulated'))) +
geom_label_repel(aes(x = logfc, y = p_value, label = gene_label), box.padding = 0.5,
point.padding = 0.5, segment.color = 'grey50', max.overlaps = 30, min.segment.length = 0.5,
force = 2, size = 3) +
ylim(0, max(df_volcano$p_value) + 5) + geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
geom_vline(xintercept = c(-1.5, 1.5), linetype = "dashed") +
labs(col = "Gene Expression\nIn Cluster of Interest",
title = "Volcano Plot Representation of Gene Expression in Cluster of Interest",
x = "Log(Fold Change)", y = "-log10(P-value)") +
theme(plot.title = element_text(face = "bold")) +
scale_color_manual(values = c("purple", "lightgreen"))
# Displays the volcano plot to identify clusters of downregulated and upregulated genes
volcano_plot
# Defines a specific gene to visualize
selected_gene <- "KRT17"
# Assigns selected gene expression values as numeric value
df_pca$gene <- as.numeric(normgexpsub[, selected_gene])
df_tsne$gene <- as.numeric(normgexpsub[, selected_gene])
df_pos$gene <- as.numeric(normgexpsub[, selected_gene])
# Visualizes selected gene in PCA space
p4 <- ggplot(df_pca, aes(x = pcs.PC1, y = pcs.PC2, col = gene)) +
geom_point(size = 0.01) +
scale_color_viridis_c() +
theme_minimal() +
labs(title = "PCA Clustering of KRT17 Gene",
x = "Principal Component 1", y = "Principal Component 2", color = "KRT17 Gene Expression") +
theme(
plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
axis.title = element_text(size = 8),
axis.text = element_text(size = 6)
)
# Visualizes selected gene in t-SNE space
df_tsne$gene <- normgexpsub[, selected_gene]
p5 <- ggplot(df_tsne, aes(x = tSNE1, y = tSNE2, col = gene)) +
geom_point(size = 0.01) +
scale_color_viridis_c() +
theme_minimal() +
labs(title = "t-SNE Embedded Space\n of KRT17 Gene",
x = "tSNE1", y = "tSNE2", color = "KRT17 Gene Expression") +
theme(
plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
axis.title = element_text(size = 8),
axis.text = element_text(size = 6)
)
# Visualizes selected gene in physical space
df_pos$gene <- normgexpsub[, selected_gene]
p6 <- ggplot(df_pos, aes(x = aligned_x, y = aligned_y, col = gene)) +
geom_point(size = 0.01) +
scale_color_viridis_c() +
theme_minimal() +
labs(title = "Physical Spatial Distribution\n of KRT17 Gene",
x = "Physical X-Position", y = "Physical Y-Position", color = "KRT17 Gene Expression") +
theme(
plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
axis.title = element_text(size = 8),
axis.text = element_text(size = 6)
)
# Combines all plots together for visualization
(p1 + p2 + p3) / (p4 + p5 + p6) / volcano_plot +
plot_layout(heights = c(1, 1, 1.5), guides = "collect")
# Sources:
# https://www.datacamp.com/doc/r/cluster
# https://rpkgs.datanovia.com/factoextra/
# https://ggrepel.slowkow.com/
# code-lesson-5.R
# code-lesson-6.R
# code-lesson-7.R
# code-lesson-8.R
# https://www.statology.org/set-seed-in-r/
# https://www.appsilon.com/post/r-tsne
# https://www.datacamp.com/tutorial/pca-analysis-r
# https://www.datacamp.com/tutorial/k-means-clustering-r
# https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame
# https://stackoverflow.com/questions/21271449/how-to-apply-the-wilcox-test-to-a-whole-dataframe-in-r
# https://biostatsquid.com/volcano-plots-r-tutorial/
# https://www.geeksforgeeks.org/how-to-create-and-visualise-volcano-plot-in-r/
# https://sjmgarnier.github.io/viridis/reference/scale_viridis.html
# https://www.analyticsvidhya.com/blog/2021/01/in-depth-intuition-of-k-means-clustering-algorithm-in-machine-learning/
# https://www.rdocumentation.org/packages/patchwork/versions/1.3.0/topics/plot_layout