Identifying Proximal Tubule Cell Populations in Spot-Resolution Spatial Transcriptomics

I am a freshman studying Computer Science and Neuroscience.

Identifying Proximal Tubule Cell Populations in Spot-Resolution Spatial Transcriptomics

For these visualizations, I switched over to the spot-resolution Visium dataset from the single-cell Xenium dataset. I was able to successfully apply the same pre-processing pipeline and maintain parameters from homework 3, including a k value of 5 for k-means clustering as determined by an elbow plot.

However, I had to alter my methodological approach to how I identified my cluster of interest. In homework 3, I was able to select an appealing cluster of interest (cluster 1 in that case) from its spatial distribution in the tissue and annotate its cell type according to the highly expressed proximal tubule cell marker gene Acmsd.

Since my goal in this homework was to identify the same cluster and cell type, I had to generalize my code. Therefore, I adapted my previous differential expression analysis code to loop through all 5 clusters in the Visium dataset and output the top 10 upregulated genes in each cluster. From there, I was able to cross-reference sources such as the Human Protein Atlas to better understand the cellular and functional identity of each of my clusters.

Ultimately, I was able to identify cluster 2 as the cluster of interest in the Visium sample, as it is highly expressed in Slc2a2, a proximal tubule cell marker according to the Human Protein Atlas. Moreover, as seen in the figures above, the spatial tissue location and tSNE representation of Slc2a2 strongly align with cluster 2, as well as the cortex of the kidney, the same area that cluster 1 of interest occupied in the previous Xenium analysis. Additionally, the high expression of other proximal tubule cell markers, such as Slc51a and Slc6a19, in this cluster further validates the annotation of this cluster.

Ultimately, although I utilized different marker genes (Acmsd vs. Slc2a2) between the Visium and Xenium datasets to identify the proximal tubule cells, both clusters are strongly enriched in the overall functionality of this cell type in solute carrying within the proximal tubules. This difference in gene expression within the clusters is likely a reflection of the overall gene detection capabilities of single-cell vs. spot resolution spatial transcriptomics.

Another change I made to this code was the parameters used for my volcano plot data visualization. In homework 3, I labeled the top 12 genes labeled by p-value. However, this approach created a cluttered and less effective volcano plot for my Visium dataset in terms of conveying genes with biological significance at hand, so I decided to only label my 3 genes of interest: Slc2a2, Slc51a, and Slc6a19 that support the proximal tubule labeling of cluster 2. In general, to label the differentially upregulated genes in the cluster, I also lowered the log fold change threshold from 0.6 to 0.3 since spot-level data seemed to produce smaller fold change values due to the mixture of cell types in each spot.

Resources: https://www.proteinatlas.org/ENSG00000163581-SLC2A2/single+cell https://www.proteinatlas.org/ENSG00000163959-SLC51A/single+cell https://www.proteinatlas.org/ENSG00000174358-SLC6A19/single+cell

Code

## HW 4 CODE: Adapting Xenium Code for Visium Analysis ## 

# Load in Data (Spot Resolution Visium Data) ---------------------------------------------
data <- read.csv('/Users/gracexu/genomic-data-visualization-2026/data/Visium-IRI-ShamR_matrix.csv')
pos <- data[,c('x', 'y')]
rownames(pos) <- data[,1]
gexp <- data[, 4:ncol(data)]
rownames(gexp) <- data[,1]
gexp[1:5,1:5]
dim(gexp) # 1224 x 19465 

# Load in required packages ------------------------------------------------------------------------------------------
library(ggplot2)
library(patchwork)

# Normalize Data (Library-size normalization --> Counts per million --> Log transformation) ---------------------------------------------
totgexp <- rowSums(gexp)
head(totgexp)
head(sort(totgexp, decreasing = TRUE))
mat <- log10(gexp / totgexp * 1e6 + 1)
dim(mat)

# PCA (Linear dimension reduction) ------------------------------------------------------------------------------------------
pcs <- prcomp(mat, center = TRUE, scale = FALSE) 
names(pcs)
head(pcs$sdev)
plot(pcs$sdev[1:50])
toppcs <- pcs$x[, 1:10] # Keep top PCs as the top 10 PCs 

# Run tSNE ------------------------------------------------------------------------------------------------------------------------
set.seed(123) # for reproducibility

tsne <- Rtsne::Rtsne(toppcs, dims = 2, perplexity = 30)
emb <- tsne$Y
df <- data.frame(emb, totgexp)
names(df)
ggplot(df, aes(x=X1, y=X2, col=totgexp)) + geom_point(size = 1)

# k-Means Clustering ------------------------------------------------------------------------------------------------------------------------

## Find optimal k value by computing within sum of squares (wss)
k_vals <- 1:20
wss <- numeric(length(k_vals))

for (i in seq_along(k_vals)) {
  km <- kmeans(toppcs, # run on top 10 PCs
               centers = k_vals[i], 
               nstart = 20, #nstart reduces randomness 
               iter.max = 100)  
  wss[i] <- km$tot.withinss
}

# make a dataframe holding k values & corresponding wss scores 
df <- data.frame(
  k = k_vals,
  wss = wss
)

# make an elbow plot --> k = 5 seems like the best choice again here 
ggplot(df, aes(x = k, y = wss)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Elbow Plot for Determining Optimal k",
    x = "Number of Clusters (k)",
    y = "Total Within Sum of Squares"
  ) +
  theme_minimal()

## Run k means with center of 5 
km <- kmeans(toppcs, centers = 5)
names(km)

cluster <- as.factor(km$cluster)

df <- data.frame(pos, emb, cluster, toppcs)
# tSNE 
ggplot(df, aes(x=X1, y=X2, col=cluster)) + geom_point(size = 0.5)
# PCA
ggplot(df, aes(x=PC1, y=PC2, col=cluster)) + geom_point(size = 0.5)
# Spatial Plot of Clusters 
ggplot(df, aes(x = x, y = y, col = cluster)) +
  geom_point(size = 0.5) +
  coord_fixed() +
  theme_minimal()

# Differential Expression Analysis: LOOPING OVER ALL CLUSTERS ------------------------------------------------------------------------------------------------------------------------
# Create an empty list to store the results 
results_list <- list()

for (clusterofinterest in 1:5) {
  in_cluster <- df$cluster == clusterofinterest

  # Calculate the log fold change (expression in cluster / expression not in cluster)
  mean_in  <- colMeans(mat[in_cluster, ])
  mean_out <- colMeans(mat[!in_cluster, ])
  logFC <- log2((mean_in + 1e-6) / (mean_out + 1e-6))
  
  # Run Wilcox Test for all genes in cluster of interest 
  pvals <- sapply(colnames(mat), function(gene) {
    wilcox.test(mat[in_cluster, gene], mat[!in_cluster, gene])$p.value
  })

  # Adjust p-values for multiple testing (Benjamini-Hochberg)
  pvals_adj <- p.adjust(pvals, method = "BH")

  # Store results for the cluster of interest 
  results_list[[clusterofinterest]] <- data.frame(
    gene = colnames(mat),
    logFC = logFC,
    pval_adj = pvals_adj,
    cluster = clusterofinterest
  )
}
  
# Print the top upregulated genes for each cluster 
for (i in 1:5) {
  cat("\n--- Cluster", i, "top markers ---\n")
  top <- results_list[[i]]
  top <- top[order(top$pval_adj), ]
  print(head(top[top$logFC > 0.6, c("gene", "logFC", "pval_adj")], 10))
}

# It appears that Cluster 2 matches Cluster 1 from HW3 based on the expression of Slc2a2 (a marker for proximal tubule cells)
clusterofinterest = 2

# Visualization Creation ------------------------------------------------------------------------------------------
## 1. Cluster of interest in reduced dimensional space ------------------------------------------------------------------------------------------
### tSNE 
p1 <- ggplot(df, aes(x = X1, y = X2, col = cluster == clusterofinterest)) +
  geom_point(size = 0.5) +
  scale_color_manual(values = c("grey", "red")) +
  labs(color = paste("Cluster 2")) +
  theme_minimal() +
  ggtitle("Cluster 2 in tSNE Space") +
  theme(plot.title = element_text(hjust = 0.5, size = 16), legend.position = "none")
p1

## 2. Cluster of interest in physical space ------------------------------------------------------------------------------------------
p2 <- ggplot(df, aes(x = x, y = y, col = cluster == clusterofinterest)) +
  geom_point(size = 0.5) +
  scale_color_manual(values = c("grey", "red")) +
  labs(color = paste("Cluster 2")) +
  theme_minimal() +
  ggtitle("Cluster 2 in Physical Space") +
  coord_fixed()  +
  theme(plot.title = element_text(hjust = 0.5, size = 16))
p2

## 3. Differentially expressed genes for your cluster of interest ------------------------------------------------------------------------------------------
# create a volcano plot visualizing differentially expressed genes in cluster 2 

# only label the genes of interest
genes_of_interest <- c("Slc2a2", "Slc51a", "Slc6a19")
genes_to_label <- volcano_df[volcano_df$gene %in% genes_of_interest, ]

# create volcano plot
p3 <- ggplot(volcano_df, aes(x = logFC, y = neglog10p, col = express)) +
  geom_point(alpha = 0.6, size = 0.8) +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
  geom_vline(xintercept = c(-0.3, 0.3), linetype = "dashed") +
  geom_label_repel(data = genes_to_label, aes(label = gene), 
                   size = 3, max.overlaps = 30) +
  scale_color_manual(values = c("Up" = "red", "Down" = "blue", "Not Significant" = "grey70")) +
  labs(title = "Differentially Expressed Genes in Cluster 2",
       x = "Log2 Fold Change", 
       y = "-Log10 Adjusted P-value",
       col = "Gene Expression Level") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 16))
p3

## 4. DEG in reduced dimensional space
### tSNE
### gene of interest = Slc2a2 
df_expr <- df
df_expr$Slc2a2 <- mat[, 'Slc2a2']
p4 <- ggplot(df_expr, aes(x = X1, y = X2, col = Slc2a2)) +
  geom_point(size = 0.5) +
  scale_color_viridis_c() +
  theme_minimal() +
  ggtitle(paste("Expression of Slc2a2 in tSNE Space"))  +
  theme(plot.title = element_text(hjust = 0.5, size = 16), legend.position = "none")
p4

## 5. Visualizing one of these DEG in space
p5 <- ggplot(df_expr, aes(x = x, y = y, col = Slc2a2)) +
  geom_point(size = 0.5) +
  scale_color_viridis_c() +
  theme_minimal() +
  ggtitle(paste("Expression of Slc2a2 in Physical Space")) +
  coord_fixed()  +
  theme(plot.title = element_text(hjust = 0.5, size = 16))
p5

# Combine all plots
(p1 | p2) / (p4 | p5) / p3

# Save plot
ggsave("/Users/gracexu/genomic-data-visualization-2026/homework_submission/HW4_Plots.png", width = 16, height = 18, dpi = 300)

# AI Help Prompts:  
## How can I adapt my current code that performs differential gene expression analysis to loop through all of the clusters in my dataset, printing the top 10 upregulated genes in each cluster?  

17 Feb 2026

« Identification of Thick Ascending Limb Cells in Xenium Dataset Identifying a cluster of Proximal Tubule Epithelial Cells »