A multipanel data visualization distinguishing the ascending loop of henle in mouse kidney tissue

Hi! I am Saadia Jameel, a first year BME Ph.D. student

A multipanel data visualization distinguishing the ascending loop of henle in mouse kidney tissue

Describe your figure briefly so we know what you are depicting (you no longer need to use precise data visualization terms as you have been doing). Write a description to convince me that your cluster interpretation is correct. Your description may reference papers and content that allowed you to interpret your cell cluster as a particular cell-type. You must provide attribution to external resources referenced. Links are fine; formatted references are not required. You must include the entire code you used to generate the figure so that it can be reproduced.

Identified cell type- Ascending Loop of Henle (Xenium)

I plotted the total within-cluster sum of squares against increasing values of k to identify the elbow and thus determine the optimal k value to use in k-means clustering. The optimal value was k = 7, which is the same value I selected in HW3 while analyzing the Visium dataset.

One change I had to make in order to identify the same cell type as in HW3 was to use only two colors to visualize the seven clusters. I repeatedly changed the color of one cluster to red while keeping the remaining clusters grey until I identified the same set of cells in the same spatial location as in HW3. The reason I could not use seven distinct colors, as I did in HW3, is that there was no clear spatial separation between clusters, making it difficult to visually identify the same cell type. This is because, in the single cell resolution Xenium dataset, each spot represents a single cell, leading to substantial variability between neighboring cells. In contrast, in the spot based Visium dataset, each spot represents the averaged gene expression of multiple cells, which produces smoother global clustering patterns (spatial correlation) and clearer separation.

After identifying the cluster, I wanted to confirm that the cell type I identified in this dataset was the same as in HW3. To do this, I performed differential expression analysis to determine whether I could detect the same marker genes used previously. In HW3, I identified the marker genes Slc12a1 and Umod for the ascending loop of Henle among the top 20 differentially expressed genes in my cluster of interest versus the rest. In the Xenium dataset, however, these genes did not appear in the top 20. I therefore modified the code to examine the top 30 differentially upregulated genes in the cluster, where I was able to recover the same marker genes.

Once I identified the same marker genes, I visualized their expression in both reduced dimensional space and physical tissue space to verify that they were expressed in nearly all cells of the newly identified cluster, rather than only in a subset. After confirming consistent expression of these genes across the cluster, I was confident that I had identified the same cell type.

In panels B and D of my figure, the identified clusters are visualized in both t-SNE space and physical tissue space. I visualized the expression of these differentially expressed genes using violin plots in panel C. The violins corresponding to cluster 1 are shifted toward higher expression values compared to other clusters, with greater density at those elevated levels.

5. Code (paste your code in between the ``` symbols)

library(patchwork)
library(ggplot2)

# load data
data <- read.csv("~/Documents/genomic-data-visualization-2026/data/Xenium-IRI-ShamR_matrix.csv.gz")

# get the spatial positions
pos <- data[,c('x', 'y')]
rownames(pos) <- data[,1]

# get gene expression values
gexp <- data[, 4:ncol(data)]
rownames(gexp) <- data[,1]

# PLAN
# 
# 1. Normalize and log transform.
# 2. dimensionality reduction PCA
# 3. tSNE
# 4. kmeans clustering- on first 5 PCs
# 5. use an elbow plot on total withinness to find an optimal k
# 6. statistical testing to identify differentially expressed genes
# 7. check what cell type the cluster could belong to.


# 1. perform normalization/ log-transform
totgexp= rowSums(gexp)
mat <- log10(gexp/totgexp *1e6 +1)

# 2. linear dimensionality reduction
pcs <- prcomp(mat)
# visualize PC
df <- data.frame(
  pos,
  pcs$x
)

# 3. tSNE
ts <- Rtsne::Rtsne(pcs$x[,1:10], dim=2)
names(ts)
emb <- ts$Y
df <- data.frame(
  emb,
  totgexp= totgexp
)

# 4. clustering

# plot total withinness elbow plot to see an appropriate value for k
wcss <- numeric()
k_vals <- 1:50   # try k from 1 to 15

for (k in k_vals) {
  km <- kmeans(pcs$x[,1:5], centers = k, nstart = 25)
  wcss[k] <- km$tot.withinss
}

elbow_df <- data.frame(
  k = k_vals,
  WCSS = wcss
)

library(ggplot2)
elbow_plot <- ggplot(elbow_df, aes(x = k, y = WCSS)) +
  geom_line() +
  geom_point(size = 2) +
  labs(
    title = "Elbow plot for choosing K",
    x = "Number of clusters (k)",
    y = "Total within-cluster sum of squares"
  ) +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

elbow_plot

set.seed(1) # for luster labels to remain the same throughout runs
clusters <- as.factor(kmeans(pcs$x[,1:5], centers= 7)$cluster)

# see how many cells are in each cluster
table(clusters)

df <- data.frame(
  emb,
  pos, 
  clusters, 
  totgexp= totgexp,
  # gene=mat[,'Slc34a1'],
  pcs$x
)

# highlight cluster of interest in a different colour
cluster_colors <- c(
  "1" = "#D7191C",  # red (interest)
  "2" = "#EEEEEE",  # light grey
  "3" = "#EEEEEE",  # light grey
  "4" = "#EEEEEE",  # light grey (is everywhere)
  "5" = "#EEEEEE",  # light grey
  "6" = "#EEEEEE",  # light grey
  "7" = "#EEEEEE"  # light grey (everywhere)
)

################ VIEW CLUSTERS #######################

# use layering of the points in the cluster of interest and the rest
# highlight cluster of interest
p_spatial <- ggplot() +
  geom_point(
    data = subset(df, clusters != "1"),
    aes(x = x, y = y),
    color = "#D9D9D9",
    alpha = 0.05,
    size  = 0.6
  ) +
  geom_point(
    data = subset(df, clusters == "1"),
    aes(x = x, y = y),
    color = "#D7191C",
    alpha = 0.8,
    size  = 0.9
  ) +
  labs(
    title= "Cluster 1 in physical space"
  )+
  theme_classic() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )
p_spatial

# clusters visualized against PC1 and PC2
# compute label positions
centroids_pc <- aggregate(cbind(PC1, PC2) ~ clusters, data = df, FUN = median)

p_pc_1_2 <- ggplot(df, aes(x = PC1, y = PC2, col = clusters)) +
  geom_point(alpha = 0.3) +
  scale_color_manual(
    values = cluster_colors,
    labels = function(x) ifelse(x == "1", "1 (cluster of interest)", x)
  ) +
  # scale_color_manual(values = cluster_colors)+
  geom_text(
    data = centroids_pc,
    aes(x = PC1, y = PC2, label = clusters),
    inherit.aes = FALSE,
    color = "black",
    fontface = "bold",
    size = 4
  )+
  labs(
    title= "Cell clusters in PC space"
  )+
  theme_classic() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

# clusters visualized against tsne 
# compute label positions
centroids_tsne <- aggregate(cbind(X1, X2) ~ clusters, data = df, FUN = median)

p_tsne <- ggplot(df, aes(x = X1, y = X2, col = clusters)) +
  geom_point(alpha = 0.3) +
  scale_color_manual(
    values = cluster_colors,
    labels = function(x) ifelse(x == "1", "1 (cluster of interest)", x)
  ) +# scale_color_manual(values = cluster_colors)+
  geom_text(
    data = centroids_tsne,
    aes(x = X1, y = X2, label = clusters),
    inherit.aes = FALSE,
    color = "black",
    fontface = "bold",
    size = 4
  )+
  labs(
    title= "Cell clusters in tSNE space"
  )+
  theme_classic() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )
p_tsne
################ CLUSTER OF INTEREST #######################

# identify cluster of interest as 1
clusterofinterest_a <- names(clusters)[clusters == 1]
clusterofinterest_b <- names(clusters)[clusters != 1]

# identify differentially expressed genes
out <- sapply(colnames(mat), function(gene){
  x1 <- mat[clusterofinterest_a,gene]
  x2 <- mat[clusterofinterest_b,gene]
  wilcox.test(x1, x2, alternative = "greater")$p.value
})

# the lower the P value, the greater the relative upregulation in x1 compared to x2
sort(out, decreasing = FALSE)[1:30]

# view the expression of a few of the top 20 differentially expressed genes within these clusters
df <- data.frame(
  gene1= mat[,'Col5a2'],
  gene2= mat[,'Egf'],
  gene3= mat[,'Slc12a1'],
  gene4= mat[,'Slc5a3'],
  gene5= mat[,'Tgfbr3'],
  gene6= mat[,'Umod'],
  pcs$x,
  pos,
  clusters,
  emb
)

library(ggplot2)

# pick genes to show 
genes <- c("Col5a2","Egf","Slc12a1","Slc5a3","Tgfbr3","Umod")

# matrix of genes of interest and their expression in cells
expr_mat <- mat[, genes, drop = FALSE]

violin_df <- data.frame(
  cluster = rep(as.factor(clusters), times = length(genes)),
  gene    = rep(genes, each = nrow(expr_mat)),
  expr    = as.vector(as.matrix(expr_mat))
)

# optional: order clusters 1..K nicely
violin_df$cluster <- factor(violin_df$cluster, levels = sort(unique(as.integer(as.character(violin_df$cluster)))))

p_violin <- ggplot(violin_df, aes(x = cluster, y = expr, fill=cluster)) +
  geom_violin(scale = "width", trim = TRUE, linewidth = 0.25) +
  scale_fill_manual(
    values = cluster_colors,
    labels = function(x) ifelse(x == "1", "1 (cluster of interest)", x)
  ) +# scale_color_manual(values = cluster_colors)+
  coord_flip() +                              # makes violins horizontal (like your screenshot)
  facet_grid(. ~ gene, scales = "free_x") +    # one column per gene (thin panels)
  labs(
    title= "Differentially expressed genes in cluster 1"
  )+
  theme_bw() +
  theme(
    panel.grid = element_blank(),
    axis.title = element_blank(),
    strip.background = element_blank(),
    # strip.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    plot.margin = margin(5.5, 5.5, 5.5, 5.5),
    plot.title = element_text(hjust = 0.5),
    strip.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.5),
  )
p_violin

# view the genes in PC space to see whether the gene expression is constant across cells in the cluster
p1 <- ggplot(df, aes(x= PC1, y= PC2, col= gene1))+ geom_point(alpha=0.6)+
  scale_colour_gradient(
    low = "#EEEEEE",
    high = "darkblue"
  )
p2 <- ggplot(df, aes(x= PC1, y= PC2, col= gene2))+ geom_point(alpha=0.6)+
  scale_colour_gradient(
    low = "#EEEEEE",
    high = "darkblue"
  )
p3 <- ggplot(df, aes(x= PC1, y= PC2, col= gene3))+ geom_point(alpha=0.6)+
  scale_colour_gradient(
    low = "#EEEEEE",
    high = "darkblue"
  )+
  labs(
    title= "Slc12a1 gene expression in PC space",
    col="gene expression"
  )+
  theme_classic() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )
p4 <- ggplot(df, aes(x= PC1, y= PC2, col= gene4))+ geom_point(alpha=0.6)+
  scale_colour_gradient(
    low = "#EEEEEE",
    high = "darkblue"
  )
p5 <- ggplot(df, aes(x= PC1, y= PC2, col= gene5))+ geom_point(alpha=0.6)+
  scale_colour_gradient(
    low = "#EEEEEE",
    high = "darkblue"
  )
p6 <- ggplot(df, aes(x= PC1, y= PC2, col= gene6))+ geom_point(alpha=0.6)+
  scale_colour_gradient(
    low = "#EEEEEE",
    high = "darkblue"
  )

p7 <- ggplot(df, aes(x= X1, y= X2, col= gene3))+ geom_point(alpha=0.3)+
  scale_colour_gradient(
    low = "#EEEEEE",
    high = "navyblue"
  )+
  labs(
    title= "Slc12a1 gene expression in tSNE space",
    col="gene expression"
  )+
  theme_classic() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

# chosen gene marker- Slc12a1
p2_spatial <- ggplot(df, aes(x=x, y=y, col=gene3))+
  geom_point(alpha=0.3)+
  scale_colour_gradient(
    low = "#EEEEEE",
    high = "navyblue"
  )+
  labs(
    title= "Slc12a1 gene expression in physical space",
    col="gene expression"
  )+
  theme_classic() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

(elbow_plot|p_tsne|p_violin)/(p_spatial|p7|p2_spatial)+
  plot_annotation(tag_levels = "A")

6. Resources

I used R documentation and the ? help function on R itself to understand functions.

Park et al, “Single-cell transcriptomics of the mouse kidney reveals potential cellular targets of kidney disease,” Science 2018; https://www.science.org/doi/10.1126/science.aar2131

I used ChatGpt to help me add cluster labels on the clusters itself for ease of distinction. The following is an example prompt. “Add cluster labels on the clusters in the plot itself in addition to the legend.”

17 Feb 2026

« HW4 Visium Dataset: Finding PT epithelial cells »