Xenium Kidney: Cluster 5 and Marker Gene Slc5a2

I’m pursuing the combined BS/MSE in Biomedical Engineering, and in my free time I love playing basketball, trying different types of tea, and unwinding with podcasts.

Xenium Kidney: Cluster 5 and Marker Gene Slc5a2

##Description

The data visualization I created depicts the identification and characterization of a transcriptionally distinct cell cluster (Cluster 5 in my code) in Xenium spatial transcriptomics data from mouse kidney tissue. The goal of the analysis was to determine whether one of the unsupervised clusters corresponds to a known kidney cell type and to validate that interpretation using both differential gene expression and spatial localization.

The workflow follows the dimensionality reduction and clustering pipeline introduced in lectures 3, 4 and 5 [1]. Raw gene counts were first normalized and log-transformed to make expression comparable across cells. PCA was then used to reduce the dimensionality of the dataset while preserving the dominant sources of transcriptional variation. An elbow analysis of k-means clustering on the top principal components suggested that seven clusters capture the major transcriptional structure of the tissue. These clusters were embedded into 2D using tSNE to visualize relationships between cells in latent space.

Panels 1 and 2 show Cluster 5 highlighted relative to all other cells. In tSNE space (Panel 1), Cluster 5 forms a compact and well-separated region, indicating a distinct transcriptional identity. In physical space (Panel 2), these cells occupy a coherent anatomical region of the kidney tissue. In my opinion, the agreement between transcriptional separation and spatial organization supports the interpretation that Cluster 5 corresponds to a real biological cell population rather than an artifact of clustering.

To characterize the molecular identity of Cluster 5, I performed gene-wise differential expression testing comparing cells inside Cluster 5 to all other cells using a one-sided Wilcoxon rank-sum test, as discussed in lecture 6 [1]. P-values were adjusted using the Benjamini-Hochberg procedure to control the false discovery rate [2]. The resulting ranked list of genes is summarized in Panel 3, which shows the top markers upregulated in Cluster 5.

In Panel 3, bar length represents log2 fold change, which measures how much more strongly a gene is expressed in Cluster 5 relative to all other clusters [3]. Importantly, fold change does not measure absolute expression level [3]. A gene can have a large fold change if it is relatively rare everywhere but still consistently more expressed in Cluster 5. To address this, the color of each bar encodes the mean expression level within Cluster 5. This combination is informative because it distinguishes genes that are both strongly enriched and highly expressed from genes that are statistically enriched but low in absolute abundance. In other words, a long bar (high fold change) does not always imply high expression; the color scale clarifies whether the gene is actually abundant in the cluster.

Among the top differentially expressed genes, through extensive literature review of a few handpicking differentially expressed genes, Slc5a2 emerged as a particularly strong and biologically meaningful marker [4]. Slc5a2 encodes a sodium-glucose cotransporter that is canonically expressed in proximal tubule epithelial cells and is widely used as a marker of this segment of the nephron (specifically, the apical domain of renal proximal tubule cells) [5]. In my data, Slc5a2 is both strongly enriched in Cluster 5 and highly specific to that cluster relative to others. Panels 4 and 5 visualize Slc5a2 expression in tSNE and physical space, respectively. In tSNE space (Panel 4), Slc5a2 expression is concentrated almost exclusively within Cluster 5, reinforcing its specificity. In physical space (Panel 5), the spatial distribution of Slc5a2-positive cells aligns with anatomical regions consistent with proximal tubule localization in the kidney cortex [6].

Panel 6 provides an additional quantitative validation by plotting the mean Slc5a2 expression across all clusters. Cluster 5 shows a pronounced elevation compared to every other cluster, confirming that Slc5a2 is not merely present but is maximally expressed in this population. This cross-cluster comparison strengthens the argument that Cluster 5 represents a proximal tubule cell population rather than a mixed or ambiguous group.

Taken together, the transcriptional distinctness of Cluster 5, the enrichment of canonical proximal tubule marker genes, and the coherent spatial localization of Slc5a2 expression strongly support the interpretation that Cluster 5 corresponds to proximal tubule epithelial cells in the mouse kidney. This interpretation is consistent with prior literature cited above describing Slc5a2 and related transport genes as defining features of proximal tubule identity and function

Also, this interpretation should be understood as a hypothesis rather than a definitive cell type assignment. Spatial transcriptomics and unsupervised k-means clustering do not produce perfectly pure cell populations. Each segmented cell can reflect local microenvironmental signals, and k-means groups cells by dominant transcriptional similarity rather than separating continuous biological gradients. As a result, Cluster 5 likely represents a population whose predominant identity is proximal tubule (PT) epithelial cells, but some heterogeneity is expected. Additional validation using independent single-cell or spatial datasets would further strengthen this assignment.

References

[1] https://jef.works/genomic-data-visualization-2026/course

[2] https://www.r-bloggers.com/2023/07/the-benjamini-hochberg-procedure-fdr-and-p-value-adjusted-explained/

[3] https://www.cores.emory.edu/eicc/_includes/documents/sections/resources/RNAseq_Methodology.html

[4] https://pmc.ncbi.nlm.nih.gov/articles/PMC8722798/

[5] https://maayanlab.cloud/Harmonizome/gene/SLC5A2

[6] https://www.nature.com/articles/s41467-025-62599-9

##Code

# Spatial Transcriptomics Analysis: Xenium Kidney Dataset
# Analyzing spatial gene expression data to identify cell types in kidney tissue

# Setup: Clear workspace and set seed for reproducibility
rm(list = ls())
set.seed(42)

# Load libraries
library(ggplot2)
library(Rtsne)
library(dplyr)
library(patchwork)

# STEP 1: Load and preprocess data

# Load Xenium spatial transcriptomics data
data <- read.csv("~/Desktop/GDV/Xenium-IRI-ShamR_matrix.csv", check.names = FALSE)

# Extract spatial coordinates for each cell
positions <- data[, c("x","y")]
rownames(positions) <- data[,1]

# Extract gene expression matrix (starts at column 4)
gene_exp <- data[, 4:ncol(data)]
rownames(gene_exp) <- data[,1]
gene_exp <- as.matrix(gene_exp)
mode(gene_exp) <- "numeric"

# Normalize gene expression by total counts per cell
total_gene_exp <- rowSums(gene_exp)
total_gene_exp[total_gene_exp == 0] <- 1

# Log-normalize: converts counts to log10(CPM + 1)
mat <- log10(gene_exp/total_gene_exp * 1e6 + 1)

# STEP 2: Principal Component Analysis (PCA)
# Reduces dimensionality from thousands of genes to top principal components

# Run PCA on normalized expression matrix
PCs <- prcomp(mat, center = TRUE, scale = FALSE)

# Create dataframe for scree plot (shows variance explained by each PC)
scree_df <- data.frame(
  PC = 1:25,
  SD = PCs$sdev[1:25]
)

# Plot scree plot to visualize variance captured
ggplot(scree_df, aes(x = PC, y = SD)) +
  geom_point() +
  geom_line() +
  theme_minimal() +
  labs(
    title = "Scree plot",
    x = "Principal Component",
    y = "Standard deviation"
  )

# Keep first 10 PCs for clustering (captures most variance)
n_pcs <- 10
top_PCs <- PCs$x[, 1:n_pcs]

# STEP 3: Determine optimal number of clusters using elbow method

# Help from Claude. Prompt: Write code to calculate total within-cluster sum of squares (TWSS) for K values from 1 to 12 using k-means clustering on the top 10 principal components. For each K, run k-means with nstart=10 and iter.max=50, and store the tot.withinss values in a vector.
twss <- sapply(1:12, function(k) {
  kmeans(top_PCs, centers = k, nstart = 10, iter.max = 50)$tot.withinss
})

# Create dataframe for elbow plot
twss_df <- data.frame(
  K = 1:12,
  TWSS = twss
)

# Plot elbow plot to choose optimal K
ggplot(twss_df, aes(x = K, y = TWSS)) +
  geom_point() +
  geom_line() +
  theme_minimal() +
  labs(
    title = "K-means elbow plot",
    x = "Number of clusters (K)",
    y = "TWSS"
  )

# STEP 4: K-means clustering
# Based on elbow plot, chose K=7 clusters

K <- 7

# Run k-means clustering
km <- kmeans(top_PCs, centers = K, nstart = 50, iter.max = 50)
clusters <- as.factor(km$cluster)

# Check number of cells in each cluster
table(clusters)

# STEP 5: t-SNE for visualization
# Further reduces dimensionality to 2D for visualization

tsne <- Rtsne(top_PCs, dims = 2, perplexity = 30, check_duplicates = FALSE)

# Extract t-SNE coordinates
emb <- tsne$Y
colnames(emb) <- c("tsne1", "tsne2")

# Combine t-SNE, cluster assignments, and spatial positions
cell_data <- data.frame(
  emb,
  clusters,
  positions
)

# Visualize clusters in t-SNE space
p_tsne <- ggplot(cell_data, aes(tsne1, tsne2, col = clusters)) +
  geom_point(size = 0.1, alpha = 0.6) +
  theme_minimal(base_size = 10) +
  theme(panel.grid = element_blank()) +
  guides(colour = guide_legend(
    override.aes = list(size = 4, alpha = 1)
  )) +
  labs(title = "Clusters in tSNE", col = "Cluster")

# Visualize clusters in physical tissue space
p_space <- ggplot(cell_data, aes(x, y, col = clusters)) +
  geom_point(size = 0.1, alpha = 0.6) +
  coord_fixed() +
  theme_minimal(base_size = 10) +
  theme(panel.grid = element_blank()) +
  guides(colour = guide_legend(
    override.aes = list(size = 4, alpha = 1)
  )) +
  labs(title = "Clusters in physical space", col = "Cluster")

p_tsne + p_space

# STEP 6: Differential expression analysis
# Find genes that are upregulated in cluster 5 compared to all other clusters

# Define cluster of interest (cluster 5)
cluster_of_interest <- 5
cluster_of_interest <- as.character(cluster_of_interest)

# Identify cells inside vs outside cluster 5
in_cells <- which(clusters == cluster_of_interest)
out_cells <- which(clusters != cluster_of_interest)

# Help from Claude. Prompt: For each gene in the matrix, perform a one-sided Wilcoxon rank-sum test comparing expression in cluster 5 cells versus all other cells, testing if cluster 5 has greater expression. Store all p-values in a vector.
pvals <- sapply(colnames(mat), function(gene) {
  wilcox.test(
    mat[in_cells, gene],
    mat[out_cells, gene],
    alternative = "greater"
  )$p.value
})

# Adjust p-values for multiple testing using Benjamini-Hochberg correction
padj <- p.adjust(pvals, method = "BH")

# Calculate mean expression inside and outside cluster 5
mean_in <- colMeans(mat[in_cells, , drop = FALSE])
mean_out <- colMeans(mat[out_cells, , drop = FALSE])

# Calculate log2 fold change
log2fc <- log2((mean_in + 0.01) / (mean_out + 0.01))

# Combine results into dataframe
de <- data.frame(
  gene = colnames(mat),
  log2fc = log2fc,
  padj = padj,
  mean_in = mean_in
)

# Sort by strongest upregulation
de_sorted <- de %>%
  arrange(desc(log2fc))

# Print top 15 differentially expressed genes
cat("\nTop 15 DE genes:\n")

print(
  de_sorted %>%
    select(gene, log2fc, padj, mean_in) %>%
    head(15)
)

# Select marker gene for visualization (Slc5a2 is a proximal tubule marker)
marker_gene <- "Slc5a2"

cell_data$gene_expr <- mat[, marker_gene]

# Label cells as cluster 5 or other
cell_data$celltype <- ifelse(
  cell_data$clusters == cluster_of_interest,
  "Cluster 5",
  "Other"
)

table(cell_data$celltype)

# STEP 7: Create individual plots for final figure

# Panel 1: Cluster 5 highlighted in tSNE space
p1 <- ggplot(cell_data, aes(tsne1, tsne2, color = celltype)) +
  geom_point(size = 0.15, alpha = 0.7) +
  scale_color_manual(values = c(
    "Cluster 5" = "#D55E00",
    "Other" = "grey85"
  )) +
  theme_classic(base_size = 11) +
  guides(color = guide_legend(
    override.aes = list(size = 3, alpha = 1)
  )) +
  labs(
    title = "Cluster 5 in tSNE",
    x = "tSNE 1",
    y = "tSNE 2",
    color = "Cell type"
  )

p1

# Panel 2: Cluster 5 in physical tissue space
p2 <- ggplot(cell_data, aes(x, y, color = celltype)) +
  geom_point(size = 0.15, alpha = 0.7) +
  scale_color_manual(values = c(
    "Cluster 5" = "#D55E00",
    "Other" = "grey85"
  )) +
  coord_fixed() +
  theme_classic(base_size = 11) +
  guides(color = guide_legend(
    override.aes = list(size = 3, alpha = 1)
  )) + 
  labs(
    title = "Cluster 5 in physical space",
    x = "x position",
    y = "y position",
    color = "Cell type"
  )
p2 

# Panel 3: Bar plot of top differentially expressed genes

library(dplyr)
library(ggplot2)

# Number of top markers to display
topN <- 15

# Ensure mean_out column exists
if (!("mean_out" %in% colnames(de))) {
  mean_in_tmp  <- colMeans(mat[in_cells, , drop = FALSE])
  mean_out_tmp <- colMeans(mat[out_cells, , drop = FALSE])
  de$mean_in <- as.numeric(mean_in_tmp)
  de$mean_out <- as.numeric(mean_out_tmp)
}

# Help from Claude. Prompt: Take the top 15 genes by log2 fold change from the differential expression results. Cap extremely small p-values at 1e-300, calculate -log10 p-values, and reorder genes as a factor for plotting from lowest to highest fold change (so they appear top to bottom in the bar plot).
de_markers <- de %>%
  mutate(
    padj_cap = pmax(padj, 1e-300),
    neglog10_p = -log10(padj_cap)
  ) %>%
  arrange(desc(log2fc), desc(mean_in)) %>%
  slice_head(n = topN) %>%
  mutate(gene = factor(gene, levels = rev(gene)))

# Create bar plot with mean expression as fill color
p3 <- ggplot(de_markers, aes(x = gene, y = log2fc)) +
  geom_col(aes(fill = mean_in)) +
  coord_flip() +
  scale_fill_viridis_c(option = "viridis") +
  theme_classic(base_size = 11) +
  labs(
    title = "Top marker genes upregulated in Cluster 5",
    subtitle = "Differential expression: Cluster 5 vs all other clusters (one-sided Wilcoxon, BH-adjusted p)",
    x = NULL,
    y = "log2 fold change (Cluster 5 / Others)",
    fill = "Mean expr\nin Cluster 5"
  )

p3

# Panels 4 and 5: Marker gene expression visualization

marker_gene <- "Slc5a2"
cell_data$marker_expr <- mat[, marker_gene]

# Panel 4: Marker gene expression in tSNE space
p4 <- ggplot(cell_data, aes(tsne1, tsne2, color = marker_expr)) +
  geom_point(size = 0.15, alpha = 0.75) +
  scale_color_viridis_c(option = "viridis") +
  theme_classic(base_size = 11) +
  labs(
    title = paste0(marker_gene, " expression in tSNE space"),
    x = "tSNE 1",
    y = "tSNE 2",
    color = "log10 expr"
  )
p4

# Panel 5: Marker gene expression in physical space
p5 <- ggplot(cell_data, aes(x, y, color = marker_expr)) +
  geom_point(size = 0.15, alpha = 0.75) +
  coord_fixed() +
  scale_color_viridis_c(option = "viridis") +
  theme_classic(base_size = 11) +
  theme(
    axis.line = element_line(),
    axis.ticks = element_line()
  )+
  labs(
    title = paste0(marker_gene, " expression in physical space"),
    x = "x position",
    y = "y position",
    color = "log10 expr"
  )
p5

# Panel 6: Mean marker gene expression per cluster (bar plot)

marker_gene <- "Slc5a2"
cell_data$marker_expr <- mat[, marker_gene]

# Calculate mean expression per cluster
mean_by_cluster <- cell_data %>%
  group_by(clusters) %>%
  summarise(mean_expr = mean(marker_expr, na.rm = TRUE), .groups = "drop") %>%
  mutate(is_cluster5 = (clusters == cluster_of_interest))

# Create bar plot highlighting cluster 5
p6 <- ggplot(mean_by_cluster, aes(x = clusters, y = mean_expr, fill = is_cluster5)) +
  geom_col() +
  scale_fill_manual(values = c("TRUE" = "#D55E00", "FALSE" = "grey80")) +
  theme_classic(base_size = 11) +
  labs(
    title = paste0("Mean ", marker_gene, " expression by cluster"),
    x = "Cluster",
    y = "Mean log10 normalized expression",
    fill = NULL
  ) +
  theme(legend.position = "none")

p6

# STEP 8: Assemble final multi-panel figure using patchwork
# Help from ChatGPT. Prompt: Create a function called fix_panel that takes a ggplot object and optional title, subtitle, and legend position arguments. The function should update the plot's title and subtitle, apply theme_classic with base_size 12, and customize the plot title to be bold size 11, subtitle size 10, set plot.title.position to "plot", add margins of 8-10 pixels around the plot, and set legend position, title size 10, and text size 9.
# =========================
# SIMPLE FINAL MULTI-PANEL
# (matches your preferred layout)
# Row 1: p1 p2 p4 p5
# Row 2: p3 (wide)
# Row 3: p6 (wide)
# =========================

library(patchwork)
library(ggplot2)

# 1) Make ONLY the key fixes:
# - wrap long titles so they don't collide
# - keep legends next to each plot (no global "collect")
# - shrink legend text a bit
# - add small margins around each panel

fix_panel <- function(p, title = NULL, subtitle = NULL, legend_pos = "right") {
  p +
    labs(title = title, subtitle = subtitle) +
    theme_classic(base_size = 12) +
    theme(
      plot.title = element_text(face = "bold", size = 11),
      plot.subtitle = element_text(size = 10),
      plot.title.position = "plot",
      plot.margin = margin(8, 10, 8, 10),
      legend.position = legend_pos,
      legend.title = element_text(size = 10),
      legend.text  = element_text(size = 9)
    )
}

# p1: cluster highlight in tSNE (remove legend to match p6)
p1_fix <- fix_panel(
  p1,
  title = "Cluster 5 in tSNE", 
  legend_pos = "none"
) + 
  theme(
    plot.title = element_text(hjust = 0.5)
  )

# p2: cluster highlight in space (keep legend on right)
p2_fix <- fix_panel(
  p2, title = "Cluster 5 in physical space", legend_pos = "right"
) +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

# p4: gene in tSNE (remove legend to match p6)
p4_fix <- fix_panel(
  p4,
  title = "Slc5a2 expression in tSNE space",
  legend_pos = "none"
) +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

# p5: gene in space (keep legend)
p5_fix <- fix_panel(
  p5, title = "Slc5a2 expression in physical space", legend_pos = "right"
) +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

# p3: DE bars (keep legend)
p3_fix <- fix_panel(
  p3,
  title = "Top DE genes (Cluster 5 vs others)",
  subtitle = "Cluster 5 vs all other clusters (one-sided Wilcoxon, BH-adjusted p)",
  legend_pos = "right"
) +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5)
  )

# p6: mean bar plot (remove legend)
p6_fix <- fix_panel(
  p6,
  title = "Mean Slc5a2 expression by cluster",
  legend_pos = "none"
) +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

# 2) Assemble in 3 rows with explicit equal widths:
row1 <- (p1_fix | p2_fix) + plot_layout(widths = c(1, 1))
row2 <- (p3_fix | p6_fix) + plot_layout(widths = c(1, 1))
row3 <- (p4_fix | p5_fix) + plot_layout(widths = c(1, 1))

final_plot <- row1 / row2 / row3 +
  plot_layout(heights = c(1, 1.2, 1)) +
  plot_annotation(
    title = "Xenium kidney: Cluster 5 and marker gene Slc5a2",
    theme = theme(
      plot.title = element_text(face = "bold", size = 20, hjust = 0.5)
    )
  )

final_plot

11 Feb 2026

« A multipanel data visualization distinguishing the ascending loop of henle in mouse kidney tissue Identifying a kidney cortical tubule region using marker genes »