Identification of TAL cells using Deconvolution

I'm a senior undergraduate BME student. I like to travel, try new foods, and bowl with my friends.

Identification of TAL cells using Deconvolution

Compare your result with the clustering and differential expression analysis you did previously in HW3/4. Explain how your results are similar or different. Create a data visualization comparing all three analyses.

This visualization is a comparison of three approaches to identifying Thick Ascending Limb (TAL) cells in kidney tissue. It includes STdeconvolve deconvolution on Visium (this homework), K-means clustering on Visium (HW3), and K-means clustering on Xenium (HW4). All of the methods use K = 4, which was selected by STdeconvolve, so that there is a fair comparison. In HW3 I originally used k = 6 and in HW4 I used k = 8, but here it is standardized to K=4 for consistency.

In the first row of the visualization, spatial organization is compared for all three analysis methods. Panel A shows STdeconvolve results, which was visualized using vizAllTopics becuase scatterbar was not recognizing column names. Each Visium spot is a stacked bar showing the estimated proportions of K = 4 deconvolved cell-types. Unlike K-means, which assigns each spot a single identity, deconvolution shows when there is mixed cell-type composition within spots. Panels B and C show the TAL cluster in red, which was identified by K-means in both the Visium and Xenium datasets, with all other cells in grey. The TAL cells were localized to the medulla in all three analyses.

The second row shows the range of differential expression along with the specific TAL marker Slc12a1. Panel D is a volcano plot of genes differentially expressed between Visium spots with high vs low TAL proportion (from deconvolution). The known TAL markers (Slc12a1, Umod, Egf, Cd24a) were labeled. Panels E and F show Slc12a1 expression in both the Visium and Xenium tissues,reiterating the spatial patterns across different platforms.

The last row shows a differentially upregulated gene (Umod) that was identified from the deconvolution analysis. It was also visualized in both Visium (G) and Xenium (H), showing the DE finding at single-cell resolution.

All three approaches were able to identify TAL cells in the same anatomical region of the medulla. The same marker genes (Slc12a1, Umod, Egf) emerged as top upregulated genes, regardless of the method used. The main difference was that deconvolution was able to capture continuous variation in TAL proportion at each spot. This showed the existence of gradients at tissue boundaries that isn’t completely represented by K-means clustering.

Code

library(ggplot2)
library(patchwork)

## read in Visium data
setwd("/Users/emmameihofer/Documents/GitHub/genomic-data-visualization-2026")
data <- read.csv("data/Visium-IRI-ShamR_matrix.csv.gz")
pos <- data[,c('x', 'y')]
rownames(pos) <- data[,1]
gexp <- data[, 4:ncol(data)]
rownames(gexp) <- data[,1]
dim(gexp)

# read in Xenium data
xdata <- read.csv("data/Xenium-IRI-ShamR_matrix.csv.gz")
set.seed(42)
xdata <- xdata[sample(1:nrow(xdata), 10000),]
xpos <- xdata[,c('x', 'y')]
rownames(xpos) <- xdata[,1]
xgexp <- xdata[, 4:ncol(xdata)]
rownames(xgexp) <- xdata[,1]

# Shared genes
genes.have <- intersect(colnames(xgexp), colnames(gexp))

# STdeconvolve ON VISIUM
library(STdeconvolve)

gexp.sub <- gexp[, genes.have]
ldas <- fitLDA(gexp.sub, Ks = c(3, 4, 5, 6, 7))
optLDA <- optimalModel(models = ldas, opt = "kneed")
results <- getBetaTheta(optLDA, perc.filt = 0.05, betaScale = 1000)

deconProp <- results$theta
deconGexp <- results$beta
K <- ncol(deconProp)

# Print top genes per cell-type
for (i in 1:K) {
  cat("Cell-type", i, ":", names(head(sort(deconGexp[i,], decreasing = TRUE), 5)), "\n")
}

# Identify TAL cell-type (highest Slc12a1 loading)
ct_scores <- deconGexp[, 'Slc12a1']
ct <- names(which.max(ct_scores))
cat("TAL cell-type:", ct, "\n")

# VISIUM K-MEANS (same K as deconvolution for comparison)
totgexp <- rowSums(gexp)
mat <- log10(gexp / totgexp * 1e6 + 1)
vg <- apply(mat, 2, var)
vargenes <- names(sort(vg, decreasing = TRUE)[1:500])
matsub <- mat[, vargenes]

pcs <- prcomp(matsub, center = TRUE, scale. = FALSE)

set.seed(42)
km_vis <- kmeans(pcs$x[, 1:15], centers = K)
vis_clusters <- as.factor(km_vis$cluster)

# Find Visium cluster most enriched for Slc12a1
vis_cl_means <- tapply(mat[, 'Slc12a1'], vis_clusters, mean)
vis_cl_interest <- names(which.max(vis_cl_means))
cat("Visium TAL cluster:", vis_cl_interest, "\n")

# XENIUM K-MEANS (same K as deconvolution for comparison)
xtotgexp <- rowSums(xgexp)
xmat <- log10(xgexp / xtotgexp * 1e6 + 1)
xpcs <- prcomp(xmat, center = TRUE, scale. = FALSE)

set.seed(42)
km_xen <- kmeans(xpcs$x[, 1:15], centers = K, nstart = 10)
xen_clusters <- as.factor(km_xen$cluster)

# Find Xenium cluster most enriched for Slc12a1
xen_cl_means <- tapply(xmat[, 'Slc12a1'], xen_clusters, mean)
xen_cl_interest <- names(which.max(xen_cl_means))
cat("Xenium TAL cluster:", xen_cl_interest, "\n")

# DIFFERENTIAL EXPRESSION (deconvolution-based on Visium)
mat_sub_norm <- log10(gexp.sub / rowSums(gexp.sub) * 1e6 + 1)

ct_prop <- deconProp[, ct]
high_spots <- names(ct_prop[ct_prop > median(ct_prop)])
low_spots <- names(ct_prop[ct_prop <= median(ct_prop)])

# Two-sided test
de_pvals_up <- sapply(colnames(mat_sub_norm), function(g) {
  wilcox.test(mat_sub_norm[high_spots, g], mat_sub_norm[low_spots, g],
              alternative = 'greater')$p.value
})
de_pvals_down <- sapply(colnames(mat_sub_norm), function(g) {
  wilcox.test(mat_sub_norm[high_spots, g], mat_sub_norm[low_spots, g],
              alternative = 'less')$p.value
})
de_pvals_twosided <- sapply(colnames(mat_sub_norm), function(g) {
  wilcox.test(mat_sub_norm[high_spots, g], mat_sub_norm[low_spots, g],
              alternative = 'two.sided')$p.value
})

mean_high <- colMeans(mat_sub_norm[high_spots, ])
mean_low <- colMeans(mat_sub_norm[low_spots, ])
log2fc <- mean_high - mean_low

de_results <- data.frame(
  gene = names(de_pvals_twosided),
  pval = de_pvals_twosided,
  pval_up = de_pvals_up,
  pval_down = de_pvals_down,
  log2fc = log2fc,
  neglog10p = -log10(de_pvals_twosided)
)

# Cap Inf values (from p-values of exactly 0)
max_finite <- max(de_results$neglog10p[is.finite(de_results$neglog10p)])
de_results$neglog10p[is.infinite(de_results$neglog10p)] <- max_finite * 1.05

de_results <- de_results[order(de_results$pval), ]

cat("\nTop 15 DE genes (deconvolution-based):\n")
print(head(de_results[order(de_results$pval_up), ], 15))

# Pick DE gene: top significant upregulated gene that isn't Slc12a1
de_up_sorted <- de_results[order(de_results$pval_up), ]
de_up_sig <- de_up_sorted[de_up_sorted$pval_up < 0.05 & de_up_sorted$log2fc > 0, ]
de_gene <- de_up_sig$gene[de_up_sig$gene != 'Slc12a1'][1]
cat("DE gene for visualization:", de_gene, "\n")

# BUILD FIGURE
library(ggplot2)
library(patchwork)
library(ggrepel)

theme_hw <- theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold", size = 11),
        legend.position = "bottom",
        plot.margin = margin(5, 10, 5, 5))

# Used vizAllTopics because scatterbar had a functional error
# Panel A: STdeconvolve proportions via vizAllTopics
pos_decon <- data.frame(x = pos[rownames(deconProp), 'x'],
                        y = pos[rownames(deconProp), 'y'])
rownames(pos_decon) <- rownames(deconProp)
pA <- vizAllTopics(deconProp, pos_decon, r = 40) +
  labs(title = paste0("A. Deconvolution (K=", K, ")")) +
  theme(plot.title = element_text(face = "bold", size = 11))

# Panel B: Visium K-means TAL cluster on tissue
df_vis <- data.frame(pos, cluster = vis_clusters)
df_vis_other <- df_vis[df_vis$cluster != vis_cl_interest, ]
df_vis_coi <- df_vis[df_vis$cluster == vis_cl_interest, ]

pB <- ggplot() +
  geom_point(data = df_vis_other, aes(x = x, y = y),
             color = "#D3D3D3", size = 1.5, alpha = 0.3) +
  geom_point(data = df_vis_coi, aes(x = x, y = y),
             color = "#E63946", size = 1.5, alpha = 0.6) +
  coord_fixed() +
  labs(title = paste0("B. Visium K-means TAL (K=", K, ")")) +
  theme_hw + theme(legend.position = "none")

# Panel C: Xenium K-means TAL cluster on tissue
df_xen <- data.frame(xpos, cluster = xen_clusters)
df_xen_other <- df_xen[df_xen$cluster != xen_cl_interest, ]
df_xen_coi <- df_xen[df_xen$cluster == xen_cl_interest, ]

pC <- ggplot() +
  geom_point(data = df_xen_other, aes(x = x, y = y),
             color = "#D3D3D3", size = 0.3, alpha = 0.3) +
  geom_point(data = df_xen_coi, aes(x = x, y = y),
             color = "#E63946", size = 0.3, alpha = 0.6) +
  coord_fixed() +
  labs(title = paste0("C. Xenium K-means TAL (K=", K, ")")) +
  theme_hw + theme(legend.position = "none")

# LLM Prompt:  How do I plot the bidirectional volcano plot like the one
# I used in this figure (hw4) using the deconvolution visium data?
# Panel D: Bidirectional volcano plot
de_results$direction <- ifelse(de_results$pval < 0.05 & de_results$log2fc > 0.25, "Upregulated",
                        ifelse(de_results$pval < 0.05 & de_results$log2fc < -0.25, "Downregulated",
                        "Not Significant"))
de_results$direction <- factor(de_results$direction,
                               levels = c("Upregulated", "Downregulated", "Not Significant"))

top5_up <- head(de_results$gene[de_results$direction == "Upregulated"], 5)
top3_down <- head(de_results$gene[de_results$direction == "Downregulated"], 3)
tal_markers <- c('Slc12a1', 'Umod', 'Egf', 'Cd24a')
available_markers <- tal_markers[tal_markers %in% de_results$gene]
label_genes <- unique(c(top5_up, top3_down, available_markers, de_gene))
de_results$label <- ifelse(de_results$gene %in% label_genes, de_results$gene, "")

volcano_cols <- c("Upregulated" = "#E63946", "Downregulated" = "#457B9D",
                  "Not Significant" = "grey70")

pD <- ggplot(de_results, aes(x = log2fc, y = neglog10p)) +
  geom_point(aes(col = direction), size = 0.8, alpha = 0.6) +
  scale_color_manual(values = volcano_cols) +
  geom_label_repel(aes(label = label), size = 3, max.overlaps = 20,
                   box.padding = 0.5, point.padding = 0.3,
                   min.segment.length = 0, segment.color = "grey40",
                   fill = "white", label.size = 0.2) +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "grey50") +
  geom_vline(xintercept = c(-0.25, 0.25), linetype = "dashed", color = "grey50") +
  scale_y_continuous(expand = expansion(mult = c(0.02, 0.15))) +
  labs(title = "D. Differentially expressed genes (deconvolution)",
       x = "Log2 fold change", y = "-log10(p-value)", color = "Direction") +
  theme_hw

# Panel E: Slc12a1 in Visium tissue
df_vis_gene <- data.frame(pos, gene = mat[, 'Slc12a1'])
df_vis_gene <- df_vis_gene[order(df_vis_gene$gene), ]
pE <- ggplot(df_vis_gene, aes(x = x, y = y, col = gene)) +
  geom_point(size = 1.5, alpha = 0.5) +
  scale_color_gradient2(low = "grey90", mid = "#FCBBA1", high = "#E63946",
                        midpoint = median(df_vis_gene$gene)) +
  coord_fixed() +
  labs(title = "E. Slc12a1 in Visium", color = "Expr") +
  theme_hw

# Panel F: Slc12a1 in Xenium tissue
df_xen_gene <- data.frame(xpos, gene = xmat[, 'Slc12a1'])
df_xen_gene <- df_xen_gene[order(df_xen_gene$gene), ]
pF <- ggplot(df_xen_gene, aes(x = x, y = y, col = gene)) +
  geom_point(size = 0.3, alpha = 0.5) +
  scale_color_gradient2(low = "grey90", mid = "#FCBBA1", high = "#E63946",
                        midpoint = median(df_xen_gene$gene)) +
  coord_fixed() +
  labs(title = "F. Slc12a1 in Xenium", color = "Expr") +
  theme_hw

# Panel G: DE gene in Visium tissue
df_vis_de <- data.frame(pos, gene = mat_sub_norm[, de_gene])
df_vis_de <- df_vis_de[order(df_vis_de$gene), ]
pG <- ggplot(df_vis_de, aes(x = x, y = y, col = gene)) +
  geom_point(size = 1.5, alpha = 0.5) +
  scale_color_gradient2(low = "grey90", mid = "#FCBBA1", high = "#E63946",
                        midpoint = median(df_vis_de$gene)) +
  coord_fixed() +
  labs(title = paste0("G. ", de_gene, " in Visium (DE)"), color = "Expr") +
  theme_hw

# Panel H: DE gene in Xenium tissue
df_xen_de <- data.frame(xpos, gene = xmat[, de_gene])
df_xen_de <- df_xen_de[order(df_xen_de$gene), ]
pH <- ggplot(df_xen_de, aes(x = x, y = y, col = gene)) +
  geom_point(size = 0.3, alpha = 0.5) +
  scale_color_gradient2(low = "grey90", mid = "#FCBBA1", high = "#E63946",
                        midpoint = median(df_xen_de$gene)) +
  coord_fixed() +
  labs(title = paste0("H. ", de_gene, " in Xenium (DE)"), color = "Expr") +
  theme_hw

# ASSEMBLE AND SAVE
top_row <- pA + pB + pC + plot_layout(nrow = 1)
mid_row <- pD + pE + pF + plot_layout(nrow = 1)
bot_row <- plot_spacer() + pG + pH + plot_layout(nrow = 1)

final <- (top_row / mid_row / bot_row) +
  plot_annotation(
    title = 'Deconvolution of Visium TAL Cells: Comparison with K-means (HW3/HW4)',
    subtitle = paste0('STdeconvolve K=', K, ' | TAL cell-type: ', ct,
                      ' | Marker: Slc12a1 | DE gene: ', de_gene),
    theme = theme(
      plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
      plot.subtitle = element_text(size = 11, hjust = 0.5, color = "grey40")
    )
  )

ggsave("~/Downloads/emeihof1_HW5.png", final,
       width = 18, height = 16, dpi = 300, bg = "white")
cat("\n=== Saved to ~/Downloads/emeihof1_HWEC2.png ===\n")

02 Mar 2026

HW EC2

« EC2-Integrated Spatial Analysis of Proximal Tubule Cell Types Using K-means Clustering and STdeconvolve (K = 5) Across Visium and Xenium Effect of Varying PC Count on tSNE Space - Visium »