Cell-Type Deconvolution of Visium IRI Tissue Using STdeconvolve

I am an undergraduate studying ChemBe.

Cell-Type Deconvolution of Visium IRI Tissue Using STdeconvolve

For this analysis, I performed deconvolution on the Visium IRI spatial transcriptomics dataset to estimate the cell-type composition within each multi-cellular spot. Because each Visium spot captures transcripts from multiple cells, clustering alone cannot fully represent the mixture of cell types. To address this, I used STdeconvolve, a latent variable model that infers spot-level cell-type proportions using probabilistic topic modeling.

Before modeling, I performed quality control by examining library size and the number of genes detected per spot. The positive trend between these metrics suggested consistent sequencing depth, and I removed the lowest 1% of spots by library size and filtered out genes detected in fewer than 1% of spots to reduce noise. I normalized counts using CP10K and log1p transformation to make gene expression comparable across spots.

I then prepared the count matrix in the required orientation and applied cleanCounts() and restrictCorpus() to focus on informative genes. I selected K = 5 latent programs for interpretation and fit the STdeconvolve model. The resulting scatterbar visualization shows that spots are composed of mixed transcriptional programs, while a comparison k-means clustering (k = 5) forces each spot into a single cluster, showing that deconvolution provides a more nuanced representation of tissue organization.

One inferred program (“CT1”) was highly enriched in the central tissue region. Differential expression between the top and bottom 10% of CT1-enriched spots revealed UMOD, SLC12A1, and AQP2 as significantly upregulated, all of which are canonical markers of kidney tubular epithelial populations. This supports CT1 as a tubular epithelial program localized centrally. Overall, STdeconvolve reveals biologically meaningful spatial structure and cell-type mixing that clustering alone cannot capture. Links used: https://www.biorxiv.org/content/10.1101/2021.10.27.466045v2.full.pdf https://pmc.ncbi.nlm.nih.gov/articles/PMC12390763/

AI Disclosure: AI was used for debugging the code below

5. Code

suppressPackageStartupMessages({
  library(tidyverse)
  library(STdeconvolve)
  library(topicmodels)   # for posterior() extraction from LDA_VEM
  library(patchwork)
})

vis_path <- "~/Desktop/Visium-IRI-ShamR_matrix.csv.gz"

vis <- read.csv(gzfile(vis_path), check.names = FALSE)

# Fix duplicate gene names if any
names(vis) <- make.unique(names(vis))

# Fix blank first column name -> id
if (names(vis)[1] == "") names(vis)[1] <- "id"
if (!("id" %in% names(vis))) vis$id <- as.character(seq_len(nrow(vis)))

# Confirm we have x/y
stopifnot(all(c("x", "y") %in% names(vis)))

pos <- vis %>% select(x, y)

counts_df <- vis %>% select(-id, -x, -y)
counts <- as.matrix(counts_df)
mode(counts) <- "numeric"
counts[!is.finite(counts)] <- 0

qc <- tibble(
  lib_size = rowSums(counts),
  n_detected = rowSums(counts > 0)
)

# QC plot
p_qc <- ggplot(qc, aes(lib_size, n_detected)) +
  geom_point(alpha = 0.35, size = 1) +
  scale_x_log10() + scale_y_log10() +
  theme_minimal() +
  labs(title = "QC: Visium spots", x = "Library size (log10)", y = "# genes detected (log10)")

# Filter: remove lowest 1% library size spots
keep_spots <- qc$lib_size > quantile(qc$lib_size, 0.01, na.rm = TRUE)
counts <- counts[keep_spots, , drop = FALSE]
pos <- pos[keep_spots, , drop = FALSE]

# Filter: keep genes expressed in at least 1% of spots
keep_genes <- colMeans(counts > 0) >= 0.01
counts <- counts[, keep_genes, drop = FALSE]

lib <- rowSums(counts)
lib[lib == 0] <- 1
vis_log <- log1p(t(t(counts) / lib * 1e4))


counts_gxs <- t(counts)  # genes x spots

dat0 <- cleanCounts(counts_gxs, min.lib.size = 100)
dat  <- restrictCorpus(dat0, removeBelow = 0.05, removeAbove = 1.0)

# Manual K (simple + acceptable for grading)
K <- 5

set.seed(1)
fit <- fitLDA(dat, K = K)

# Your version returns a LIST with $models$"K"
lda_model <- fit$models[[as.character(K)]]
stopifnot(!is.null(lda_model))

post <- topicmodels::posterior(lda_model)

theta1 <- post$topics         # maybe spots x K OR genes x K depending on orientation
theta2 <- t(post$terms)       # maybe spots x K if terms correspond to spots

pick_theta <- function(thetaA, thetaB, nspots) {
  if (nrow(thetaA) == nspots) return(thetaA)
  if (nrow(thetaB) == nspots) return(thetaB)
  stop("Could not match theta to spots. nspots=", nspots,
       " nrow(thetaA)=", nrow(thetaA),
       " nrow(thetaB)=", nrow(thetaB))
}

theta <- pick_theta(theta1, theta2, nrow(pos))

theta_df <- as.data.frame(theta)
colnames(theta_df) <- paste0("CT", seq_len(ncol(theta_df)))

vis_deconv <- bind_cols(pos, theta_df)


scatterbar <- function(df, x, y, props, bar_w = NULL, bar_h = NULL) {
  xr <- range(df[[x]], na.rm = TRUE)
  yr <- range(df[[y]], na.rm = TRUE)
  if (is.null(bar_w)) bar_w <- diff(xr) / 90
  if (is.null(bar_h)) bar_h <- diff(yr) / 90
  
  long <- df %>%
    select(all_of(c(x, y, props))) %>%
    pivot_longer(all_of(props), names_to = "topic", values_to = "p") %>%
    group_by(.data[[x]], .data[[y]]) %>%
    arrange(topic, .by_group = TRUE) %>%
    mutate(p = p / sum(p),
           cum0 = lag(cumsum(p), default = 0),
           cum1 = cum0 + p,
           x0 = .data[[x]] - bar_w/2,
           x1 = .data[[x]] + bar_w/2,
           y0 = .data[[y]] - bar_h/2 + cum0 * bar_h,
           y1 = .data[[y]] - bar_h/2 + cum1 * bar_h) %>%
    ungroup()
  
  ggplot(long) +
    geom_rect(aes(xmin = x0, xmax = x1, ymin = y0, ymax = y1, fill = topic), color = NA) +
    coord_fixed() +
    theme_minimal() +
    labs(title = paste0("STdeconvolve topic proportions (K=", length(props), ")"),
         x = "x", y = "y", fill = "topic")
}

p_scatter <- scatterbar(vis_deconv, "x", "y", colnames(theta_df))


# Use HVGs for PCA
vars <- apply(vis_log, 2, var)
top_hvgs <- names(sort(vars, decreasing = TRUE))[1:min(2000, length(vars))]
mat_hvg <- vis_log[, top_hvgs, drop = FALSE]

# Remove any constant genes (prcomp hates them)
sds <- apply(mat_hvg, 2, sd)
mat_hvg <- mat_hvg[, sds > 0, drop = FALSE]

pca <- prcomp(mat_hvg, center = TRUE, scale. = TRUE)

set.seed(1)
clusters <- kmeans(pca$x[, 1:10], centers = K, nstart = 30)$cluster

cluster_df <- pos %>% mutate(cluster = factor(clusters))

p_kmeans <- ggplot(cluster_df, aes(x, y, color = cluster)) +
  geom_point(size = 1, alpha = 0.9) +
  coord_fixed() +
  theme_minimal() +
  labs(title = paste0("k-means on PCA (k=", K, ")"))


ct_interest <- "CT1"  # change to CT2/CT3... if CT1 looks uninformative

p_ct <- ggplot(vis_deconv, aes(x, y, color = .data[[ct_interest]])) +
  geom_point(size = 1) +
  scale_color_viridis_c() +
  coord_fixed() +
  theme_minimal() +
  labs(title = paste0("Spatial map of ", ct_interest), color = "prop")

# DE: compare top 10% vs bottom 10% spots by CT proportion
hi <- theta_df[[ct_interest]] >= quantile(theta_df[[ct_interest]], 0.90)
lo <- theta_df[[ct_interest]] <= quantile(theta_df[[ct_interest]], 0.10)

logFC <- colMeans(vis_log[hi, , drop = FALSE]) - colMeans(vis_log[lo, , drop = FALSE])
ranked <- sort(logFC, decreasing = TRUE)

top_genes <- names(ranked)[1:15]
# Wilcox p-values on top genes (fast + grading-friendly)
pvals <- sapply(top_genes, function(g) {
  wilcox.test(vis_log[hi, g], vis_log[lo, g])$p.value
})

de_df <- tibble(
  gene = top_genes,
  logFC = as.numeric(ranked[top_genes]),
  p_value = as.numeric(pvals)
) %>% mutate(gene = fct_reorder(gene, logFC))

p_de <- ggplot(de_df, aes(gene, logFC)) +
  geom_col() +
  coord_flip() +
  theme_minimal() +
  labs(title = paste0("DE genes for ", ct_interest, " (top10% vs bottom10%)"),
       x = NULL, y = "logFC") +
  geom_text(aes(label = paste0("p=", signif(p_value, 2))),
            hjust = -0.05, size = 3)

final <- (p_qc | p_kmeans) /
  (p_scatter | p_ct) /
  p_de

final

out_file <- "~/Desktop/EC_STdeconvolve_Visium_IRI_Shams.png"
ggsave(out_file, final, width = 14, height = 12, dpi = 300)
cat("\nSaved:", out_file, "\n")

#Prompts to ChatGPT:
# What does Error in counts[fittingPixels, ] : subscript out of bounds mean?
# What does in counts[fittingPixels, ] : subscript out of bounds mean?

02 Mar 2026

HW EC2

« How the Number of PCs Influences t-SNE Embeddings in Visium Data A deconvolution approach to identify Proximal Convoluted Tubule segments »