Identifying a cluster of Proximal Tubule Epithelial Cells

I'm Isabella, a Senior BME.

Identifying a cluster of Proximal Tubule Epithelial Cells

Description

To identify the same group of cells in the Visium dataset, I adapted the same analysis I used previously for the Xenium dataset. Like before, I normalized the raw counts and did PCA for dimensionality reduction. I made a scree plot and decided to keep using the first 10 PCs, since the elbow in the scree plot was at a similar place as the Xenium dataset. I also made an elbow plot of withinness for kmeans. The curve also showed a flattening around k = 9, so I used 9 clusters for the analysis, and the number did not change between the two datasets.

After clustering, I visualized each cluster in tissue space to try and find the same cluster as before. Cluster 9 had a spatial pattern very similar to my previous cluster. I performed differential expression analysis comparing cluster 9 to all other spots in the same way as before, using a Wilcoxon rank-sum test (alternative = “greater”) to identify the genes upregulated in the cluster. I kept the genes with p < 0.05 and ranked them by specificity. The top differentially expressed genes were Serpina1d, Slc22a19, Scd1, and Cyp7b1. Both Slc22a19 and Cyp7b1 were also highly specific markers in the Xenium dataset. Slc22a19 has been experimentally localized to proximal tubule cells, and Cyp7b1 is expressed in kidney epithelial cells, including the tubules (https://www.proteinatlas.org/ENSG00000172817-CYP7B1/tissue/kidney and https://www.sciencedirect.com/science/article/abs/pii/S0022356524322852)and . The presence of these same markers supports the idea that cluster 9 represents the same cell type that I identified before. The other two differentially expressed genes (Serpina1d and Scd1) are specifically expressed in the proximal tubule epithelial cells as well. (https://www.bgee.org/gene/ENSMUSG00000071177 and https://pubmed.ncbi.nlm.nih.gov/25409434/)

The combination of spatial location, similar gene markers seen in the previous dataset, and other markers that are found in the proximal tubule epithelial cells, shows that cluster 9 in the Visium dataset corresponds to the same group of proximal tubule epithelial cells that I identified before in the Xenium dataset.

Overall, there were no changes that I needed to make to my code to identify the same dataset, other than selecting a different cluster of interest. When I evaluated whether changes to the number of principal components or clusters were necessary, both the screen and elbow plot showed similar optimal values, so I continued 10 PCs and k = 9. Because of that, I didn’t make any major changes to the clustering parameters.

Code (paste your code in between the ``` symbols)

#NOTE: view the final plot in full screen mode or else it looks squished

library(ggplot2)
library(patchwork)

data <- read.csv("~/Desktop/Visium-IRI-ShamR_matrix.csv.gz")


pos <- data[,c('x', 'y')] 
rownames (pos) <- data[, 1] 
gexp <- data[, 4: ncol(data)]
rownames (gexp) <- data[, 1]

totgexp = rowSums(gexp)
mat <-log10(gexp/totgexp *1e6 + 1)

vg <- apply(mat, 2, var)
vargenes <- names(sort(vg, decreasing=TRUE))

matsub <- mat[, vargenes]

#PCA (keeping top 10 PCS)
pcs <- prcomp(matsub)
selected_PCs <- pcs$x[, 1:10]

#kmeans (using 9 centers)
set.seed(1)
km <- kmeans(selected_PCs, centers = 9)
clusters <- factor(km$cluster)

pc_df <- data.frame(PC1 = pcs$x[,1], PC2 = pcs$x[,2], cluster = clusters)
space_df <- data.frame(x = pos$x, y = pos$y, cluster = clusters)


#highlight one cluster
space_df$highlight <- ifelse(space_df$cluster == "9", "Cluster of interest", "Other clusters")
pc_df$highlight <- ifelse(pc_df$cluster == "9", "Cluster of interest", "Other clusters")

#Plot with cluster 9 highlighted in PCA space
pca_cluster <- ggplot(pc_df, aes(x=PC1, y=PC2, color = highlight)) +
  geom_point(size = 1, alpha = 0.4) +
  scale_color_manual(name = "Cluster", values = c("Other clusters" = "grey80", "Cluster of interest" = "blue"),
                     guide = guide_legend(override.aes = list(size = 1, alpha = 0.8))) +
  theme_classic() +
  coord_fixed()+
  ggtitle("Cluster of interest highlighted in PCA space")

#Plot with cluter 9 highlighted in spatial space
space_cluster <- ggplot(space_df, aes(x, y, color = highlight)) +
  geom_point(size = 1, alpha = 0.4) +
  scale_color_manual(name = "Cluster", values = c("Other clusters" = "grey80", "Cluster of interest" = "blue"),
                     guide = guide_legend(override.aes = list(size = 1, alpha = 0.8))) +
  theme_classic() +
  coord_fixed()+
  ggtitle("Cluster 9 highlighted in physical space")


#Differentially expressed genes
#Define in vs out
in_cells  <- names(clusters)[clusters == "9"]
out_cells <- names(clusters)[clusters != "9"]

mat_in  <- mat[in_cells, , drop = FALSE]
mat_out <- mat[out_cells, , drop = FALSE]

#Percent detection
pct_in  <- colMeans(mat_in  > 0)
pct_out <- colMeans(mat_out > 0)

#Wilcoxon test per gene
genes <- colnames(mat)
pval <- sapply(genes, function(g) {
  wilcox.test(mat_in[, g], mat_out[, g], alternative='greater')$p.value
})

#make table
de <- data.frame(
  pct_in = pct_in,
  pct_out = pct_out,
  specificity = pct_in - pct_out,
  pval = pval
)

#Only keep significant genes (p < 0.05)
de_sig <- de[de$pval < 0.05, ]

#Sort by specificity
de_sig <- de_sig[order(de_sig$specificity, decreasing = TRUE), ]

head(de_sig, 10)

#Top 10 for plotting
top_de <- head(de_sig, 10)
top_de$gene <- factor(rownames(top_de), levels = rev(rownames(top_de)))
de_long <- data.frame(
  gene = rep(top_de$gene, 2),
  percent = c(top_de$pct_in, top_de$pct_out),
  group = rep(c("In cluster", "Outside cluster"), each = nrow(top_de))
)

de_pct_plot <- ggplot(de_long, aes(x = gene, y = percent, fill = group)) +
  geom_col(position = position_dodge(width = 0.8)) +
  coord_flip() +
  theme_classic(base_size = 12) +
  scale_fill_manual(values = c("In cluster" = "blue", "Outside cluster" = "red")) +
  ylab("Fraction of cells expressing gene") +
  xlab("") +
  ggtitle("Top differentially expressed genes ranked by specificity")

#tsne
library(Rtsne)
set.seed(1)

tsne <- Rtsne(selected_PCs, dims = 2, perplexity = 30)

tsne_df <- data.frame(
  tSNE1 = tsne$Y[, 1],
  tSNE2 = tsne$Y[, 2],
  cluster = clusters  
)

#tSNE plot with cluster 9 highlighted
tsne_df$highlight <- ifelse(tsne_df$cluster == "9", "Cluster of interest", "Other clusters")
tsne_cluster <- ggplot(tsne_df, aes(tSNE1, tSNE2, color = highlight)) +
  geom_point(size = 1, alpha=0.4) +
  scale_color_manual(name = "Cluster", values = c("Other clusters" = "grey80", "Cluster of interest" = "blue"),
                     guide = guide_legend(override.aes = list(size = 1, alpha = 0.8))) +
  theme_classic() +
  coord_fixed()+
  ggtitle("Cluster of interest highlighted in tSNE space")

#plot gene expression in PCA and spatial space
pc_df$gene_expr <- mat[rownames(pc_df), "Slc22a19"]
space_df$gene_expr <- mat[rownames(space_df), "Slc22a19"]

pca_gene <- ggplot(pc_df, aes(x = PC1, y = PC2, color = gene_expr)) +
  geom_point(size = 1, alpha = 0.4) +
  theme_classic() +
  coord_fixed() +
  scale_color_gradient(low = "grey80", high = "blue")+
  ggtitle("Slc22a19 expression in PCA space") +
  labs(color = "Slc22a19")

space_gene <- ggplot(space_df, aes(x = x, y = y, color = gene_expr)) +
  geom_point(size = 1, alpha = 0.4) +
  theme_classic() +
  coord_fixed() +
  scale_color_gradient(low = "grey80", high = "blue")+
  ggtitle("Slc22a19 expression in physical space") +
  labs(color = "Slc22a19")

#NOTE: view the plot in full screen mode or else it looks squished
(pca_cluster | tsne_cluster | space_cluster) / (de_pct_plot | pca_gene | space_gene)

Attributions: Code adapted from HW3, all previous attributions still apply

17 Feb 2026

« Identifying Proximal Tubule Cell Populations in Spot-Resolution Spatial Transcriptomics Identification of kidney collecting duct principal cells through dimensionality reduction, k-means clustering, and differential expression analysis »