HW4-Identification and Characterization of a Transcriptionally Distinct Proximal Tubule Epithelial Cluster in Visium Kidney Spatial Transcriptomics Data

Hi everyone, I’m a second-year PhD student in Biomedical Engineering, and my research focuses on medical imaging. I’m very happy to meet you all and looking forward to learning together.

HW4-Identification and Characterization of a Transcriptionally Distinct Proximal Tubule Epithelial Cluster in Visium Kidney Spatial Transcriptomics Data

1. Describe your figure briefly.

In these panels, I depict the Visium kidney spatial transcriptomics dataset in latent tSNE-embedded space and over the original spatial slide coordinates. Before clustering, I evaluated a range of k values (k = 2–20) by running k-means on the first 15 principal components and examining the total within-cluster sum of squares to identify an elbow point. This analysis suggested k = 10 for Xenium, which provides a reasonable balance between capturing transcriptional heterogeneity and avoiding over-fragmentation. In contrast, applying the same elbow-plot procedure to the Visium dataset suggested a smaller k (k = 5), consistent with the lower effective resolution and increased spot-level mixing in Visium.

In HW3 (Xenium), I originally aimed to identify an endothelial-like population using canonical endothelial markers (e.g., Kdr/VEGFR2, Pecam1, Cdh5, Vwf). However, upon inspection of the Visium gene set, Kdr was not present, and I did not detect other canonical endothelial marker genes in the available panel. Because this prevented a marker-based identification of endothelial cells in the Visium dataset, I instead began by identifying a different cell type in the Xenium. 

Using k = 10 for Xenium, I examined the resulting clusters in tSNE-embedded space and selected a cluster of interest that forms a compact and well-separated group relative to surrounding clusters (A), indicating a transcriptionally distinct population. I then mapped the same cluster back onto the original spatial coordinates (B) to assess whether its spatial organization is biologically coherent rather than an artifact of the embedding.

Next, I performed differential expression analysis for the cluster of interest versus all other spots using a Wilcoxon rank-sum test and computed log2 fold change for each gene. Differential gene expression is visualized with a volcano plot (C), where genes are positioned by effect size (log2 fold change) and statistical significance (−log10 p-value). I selected Slc22a6, a gene supported by the Xenium panel, as a representative marker for the cluster of interest. Slc22a6 encodes the organic anion transporter OAT1 and is a well-established marker of proximal tubule epithelial cells[1].

I then visualized the normalized expression of Slc22a6 across the tSNE embedding (D) and across the original tissue coordinates (E). Enrichment of Slc22a6 within the selected cluster in latent space, together with a spatial distribution consistent with renal cortical proximal tubules, provides evidence that the cluster corresponds to a coherent population of proximal tubule epithelial cells.

Finally, having identified a robust, panel-supported proximal tubule marker and corresponding cell population in Xenium, I leveraged this information to guide interpretation in the Visium dataset. Differential expression analysis in Visium revealed Slc22a6 to be significantly upregulated, as evidenced by its prominent position in the Visium volcano plot. Visualization of Slc22a6 expression across the Visium latent embedding and physical tissue coordinates demonstrated a spatial pattern highly concordant with that observed in Xenium, with enrichment localized to cortical regions characteristic of proximal tubules. The close agreement in both transcriptional signature and physical spatial distribution across platforms supports the conclusion that the Visium cluster represents the same proximal tubule epithelial cell population identified in Xenium. This cross-platform strategy ensures that cell-type inference is driven by genes actually captured on each platform while maintaining a consistent and biologically coherent interpretation across datasets.

[1]Vávra, Jiří, et al. “Functional characterization of rare variants in OAT1/SLC22A6 and OAT3/SLC22A8 urate transporters identified in a gout and hyperuricemia cohort.” Cells 11.7 (2022): 1063.

2. Code (paste your code in between the ``` symbols)

## ---------------------------
## 0) Load required packages
## ---------------------------
pkgs <- c("data.table", "ggplot2", "patchwork", "Rtsne")

to_install <- pkgs[!pkgs %in% rownames(installed.packages())]
if (length(to_install) > 0) {
  install.packages(to_install, dependencies = TRUE)
}

library(data.table)
library(ggplot2)
library(patchwork)
library(Rtsne)


set.seed(123)

## ---------------------------
## 1) Load data
## ---------------------------
file <- "/Users/xl/Desktop/JHU2026Spring/Genomic-Data-Visualization/Homework/HW4/Visium-IRI-ShamR_matrix.csv.gz"
dt <- fread(file)

cat("Data dimensions:", dim(dt), "\n")

## 2) Explicitly set columns
barcode_col <- "V1"
x_col <- "x"
y_col <- "y"

stopifnot(all(c(barcode_col, x_col, y_col) %in% colnames(dt)))

barcodes <- dt[[barcode_col]]

pos <- data.frame(
  aligned_x = dt[[x_col]],
  aligned_y = dt[[y_col]]
)
rownames(pos) <- barcodes

## 3) Expression matrix: all remaining columns are genes
gene_cols <- setdiff(colnames(dt), c(barcode_col, x_col, y_col))

# Convert to numeric matrix safely
gexp_dt <- dt[, ..gene_cols]

for (cc in colnames(gexp_dt)) {
  if (!is.numeric(gexp_dt[[cc]])) {
    gexp_dt[[cc]] <- as.numeric(gexp_dt[[cc]])
  }
}

gexp <- as.matrix(gexp_dt)
rownames(gexp) <- barcodes

# Remove all-zero genes
gexp <- gexp[, colSums(gexp, na.rm = TRUE) > 0, drop = FALSE]

cat("Expression matrix dim:", dim(gexp), "\n")

## 4) Feature selection + normalization
topN <- 1500
gene_order <- order(colSums(gexp), decreasing = TRUE)
head(gene_order)
topgenes <- colnames(gexp)[gene_order[1:min(topN, ncol(gexp))]]
gsub <- gexp[, topgenes, drop = FALSE]

libsize <- rowSums(gsub)
libsize[libsize == 0] <- 1

norm <- (gsub / libsize) * 1e4
logexp <- log1p(norm)

## 5) PCA + kmeans
pcs <- prcomp(logexp, center = TRUE, scale. = TRUE)

ks <- 2:20

totw <- sapply(ks, function(k) {
  km_tmp <- kmeans(pcs$x[,1:15], centers = k, nstart = 20)
  km_tmp$tot.withinss
})

elbow_df <- data.frame(k = ks, tot_withinss = totw)

ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
  geom_point(size = 2) +
  geom_line() +
  labs(title = "Elbow plot for choosing k",
       x = "Number of clusters (k)",
       y = "Total within-cluster sum of squares") +
  theme_classic()

k_final <- 5
km <- kmeans(pcs$x[, 1:15], centers = k_final, nstart = 25)
clusters <- km$cluster

pca_df <- data.frame(
  PC1 = pcs$x[,1],
  PC2 = pcs$x[,2],
  cluster = factor(clusters)
)

ggplot(pca_df, aes(PC1, PC2, color = cluster)) +
  geom_point(size = 0.6) +
  labs(title = "PCA of kidney Xenium data colored by k-means clusters",
       color = "Cluster") +
  theme_classic()


sp <- data.frame(pos)
sp$cluster <- factor(clusters)
ggplot(sp, aes(aligned_x, aligned_y)) +
  geom_point(size = 0.8, color = "steelblue") +
  facet_wrap(~ cluster, ncol = 3) +
  labs(title = "Spatial distribution of each cluster",
       x = "x", y = "y") +
  theme_classic()


## 6) Pick a cluster of interest
pca_xy <- pcs$x[, 1:2]
interest <- 3
cat("Cluster of interest:", interest, "\n")

cells_interest <- barcodes[clusters == interest]
cells_other <- barcodes[clusters != interest]

## 7) Differential expression (Wilcoxon)
wilcox_p <- sapply(colnames(logexp), function(g) {
  x <- logexp[cells_interest, g]
  y <- logexp[cells_other, g]
  suppressWarnings(wilcox.test(x, y)$p.value)
})


log2fc <- sapply(colnames(norm), function(g) {
  log2((mean(norm[cells_interest, g]) + 1e-3) /
         (mean(norm[cells_other, g]) + 1e-3))
})

de_df <- data.frame(
  gene = names(wilcox_p),
  pval = as.numeric(wilcox_p),
  neglog10p = -log10(as.numeric(wilcox_p) + 1e-300),
  log2fc = as.numeric(log2fc[names(wilcox_p)])
)

de_df <- de_df[order(de_df$pval, -de_df$log2fc), ]
ggplot(de_df, aes(x = log2fc, y = neglog10p)) +
  geom_point(size = 0.6) +
  geom_vline(xintercept = c(-1, 1), linetype = "dashed") +
  geom_hline(yintercept = -log10(1e-5), linetype = "dashed") +
  labs(title = "Volcano plot: cluster of interest vs others",
       x = "log2 fold-change",
       y = "-log10(p-value)") +
  theme_classic()

sig_up <- de_df[de_df$pval < 1e-5 & de_df$log2fc > 1, ]
head(sig_up, 20)



marker_gene <- "Slc22a6"


## 9) tSNE embedding
tsne_res <- Rtsne(pcs$x[, 1:15], perplexity = 30, check_duplicates = FALSE)
emb <- data.frame(tsne_res$Y)
colnames(emb) <- c("tSNE1","tSNE2")
colnames(emb)
head(emb)
emb$is_interest <- ifelse(clusters == interest, "Cluster of interest", "Other")

## 10) Panels
big_theme <- theme(
  plot.title  = element_text(size = 14, face = "bold"),
  legend.text = element_text(size = 15),
  legend.title = element_text(size = 16),
  axis.title  = element_text(size = 13),
  axis.text   = element_text(size = 8)
)

p1 <- ggplot(emb, aes(tSNE1, tSNE2, color = is_interest)) +
  geom_point(size = 1.5) +
  labs(title = "A. Cluster of interest in tSNE space", color = NULL) +
  theme_classic() +
  theme(
    legend.text = element_text(size = 13)
  ) +
  big_theme
sp <- data.frame(pos)
sp$is_interest <- ifelse(clusters == interest, "Cluster of interest", "Other")

p2 <- ggplot(sp, aes(aligned_x, aligned_y, color = is_interest)) +
  geom_point(size = 1.5) +
  labs(title = "B. Cluster of interest in physical space",
       x = "x", y = "y", color = NULL) +
  theme_classic() +
  theme(
    legend.text = element_text(size = 13)
  ) +
  big_theme

de_df$group <- "Other"
de_df$group[de_df$pval < 1e-5 & de_df$log2fc > 1]  <- "Upregulated"
de_df$group[de_df$pval < 1e-5 & de_df$log2fc < -1] <- "Downregulated"


de_df$label <- ""
de_df$label[de_df$gene == marker_gene] <- marker_gene
p3 <- ggplot(de_df, aes(log2fc, neglog10p, color = group)) +
  
  # normal points
  geom_point(size = 0.8) +
  
  # red circle highlight (no fill, just outline)
  geom_point(
    data = subset(de_df, gene == marker_gene),
    shape = 21,          # circle with border
    size = 4,
    stroke = 1.2,
    color = "red",
    fill = NA
  ) +
  
  # annotation text
  geom_text(
    data = subset(de_df, gene == marker_gene),
    aes(label = gene),
    color = "red",
    vjust = -1.2,
    size = 4,
    show.legend = FALSE
  ) +
  
  scale_color_manual(
    values = c(
      "Upregulated"   = "#D55E00",
      "Downregulated" = "#0072B2",
      "Other"         = "grey70"
    )
  ) +
  
  labs(
    title = "C. Differentially expressed genes",
    x = "log2 fold-change",
    y = "-log10(p-value)",
    color = "Gene group"
  ) +
  
  theme_classic() +
  big_theme +
  coord_cartesian(clip = "off") +
  scale_y_continuous(expand = expansion(mult = c(0.05, 0.2)))


emb$gene_expr <- logexp[, marker_gene]
p4 <- ggplot(emb, aes(tSNE1, tSNE2, color = gene_expr)) +
  geom_point(size = 1.5) +
  labs(title = paste("D.", marker_gene, "expression in tSNE"),
       color = "log1p(norm)") +
  theme_classic() +
  big_theme

sp$gene_expr <- logexp[, marker_gene]
p5 <- ggplot(sp, aes(aligned_x, aligned_y, color = gene_expr)) +
  geom_point(size = 1.5) +
  labs(title = paste("E.", marker_gene, "expression in physical space"),
       x = "x", y = "y", color = "log1p(norm)") +
  theme_classic() +
  big_theme


multi_panel <- wrap_plots(
  p1, p2, p3,
  p4, p5, plot_spacer(),
  ncol = 3, byrow = TRUE
)

ggsave(
  filename = "multi_panel_cluster_Slc22a6_Visium.png",
  plot = multi_panel,
  width = 12,
  height = 8,
  dpi = 300
)

15 Feb 2026

« Characterizing Cell Type in Cluster 2 of Visium Dataset Using Dimensionality Reduction and Differential Expression Analysis HW 4 »