Llucero3

In HW3, I identified a transcriptionally distinct cell type using an unsupervised workflow consisting of library-size normalization (CP10K + log1p), highly variable gene selection, PCA, tSNE embedding, and k-means clustering. I selected a cluster of interest based on its separation in tSNE space and then ranked genes by log fold change between that cluster and all other observations to identify marker genes. The HW3 multi-panel visualization demonstrated that this cluster formed a coherent region in embedding space, was spatially localized in tissue coordinates, and was enriched for specific marker genes, confirming that it represented a biologically meaningful cell population.

For HW4, I adapted this approach to identify the same biological cell type in the other dataset (Visium ↔ Xenium). Because clustering structure and cluster IDs are not guaranteed to align across platforms due to differences in resolution (spot-level vs. single-cell) and gene detection sensitivity, I modified the labeling strategy. Rather than assuming a specific cluster number corresponded to the same identity, I defined the cell type using the same canonical marker panel (Aqp2, Avpr2, Slc14a2, Scnn1g) and computed a marker module score for each observation as the mean log1p expression of these genes. I then labeled the putative cell type as the top 10% of observations by module score, ensuring a consistent marker-based definition across datasets.

The resulting multi-panel figure provides convergent evidence that the same cell type was recovered. The spatial module score map shows a localized region of elevated marker expression rather than diffuse signal. The discrete cell-type assignment overlaps this same spatial region, indicating that thresholding isolates a coherent subset. Importantly, the individual marker gene maps demonstrate coordinated co-expression of all four canonical markers within the same region, confirming that the signal reflects a shared transcriptional program rather than a single-gene artifact. Statistical testing (Wilcoxon rank-sum test and t-test) shows that the module score is significantly higher in the labeled group compared to all other observations (p-values ≪ 0.05), quantitatively supporting enrichment.

Compared to HW3, the key modification was replacing a fixed cluster ID with a marker-guided module scoring approach, since clustering boundaries can shift between technologies. By defining the cell type through shared canonical markers and demonstrating spatial coherence, multi-gene co-expression, and statistical enrichment in both datasets, I provide strong evidence that the same biological cell type was identified across platforms.

AI Disclosure: AI was used to debug the code below to help it run.

5. Code (paste your code in between the ``` symbols)

library(data.table)
library(ggplot2)
library(patchwork)

file_path <- file.choose()  # pick the .csv.gz

markers <- c("Aqp2", "Avpr2", "Slc14a2", "Scnn1g")
present_markers <- intersect(markers, names(df))
print(present_markers)

if (length(present_markers) < 2) stop("Markers missing after df load/rename.")

df[, x := as.numeric(x)]
df[, y := as.numeric(y)]

for (g in present_markers) {
  df[[g]] <- as.numeric(df[[g]])
  df[[g]] <- log1p(df[[g]])
}

df$score <- rowMeans(as.matrix(df[, ..present_markers]), na.rm = TRUE)

thr <- quantile(df$score, 0.90, na.rm = TRUE)
df$label <- ifelse(df$score >= thr, "Cell-type", "Other")


# Label top 10% as cell-type
thr <- quantile(df$score, 0.90, na.rm = TRUE)
df$label <- ifelse(df$score >= thr, "Cell-type", "Other")

# Stats (required)
wilcox_p <- wilcox.test(df$score[df$label=="Cell-type"], df$score[df$label=="Other"])$p.value
ttest_p  <- t.test(df$score[df$label=="Cell-type"], df$score[df$label=="Other"])$p.value

# Plot settings (small + transparent)
pt_size <- 0.35
pt_alpha <- 0.45

p_score <- ggplot(df, aes(x, y, color = score)) +
  geom_point(size = pt_size, alpha = pt_alpha) +
  scale_y_reverse() +
  coord_fixed() +
  theme_minimal() +
  labs(title = "Marker module score (spatial)",
       subtitle = paste0("Wilcox p=", signif(wilcox_p, 3),
                         " | t-test p=", signif(ttest_p, 3)),
       color = "Score")

p_label <- ggplot(df, aes(x, y, color = label)) +
  geom_point(size = pt_size, alpha = pt_alpha) +
  scale_y_reverse() +
  coord_fixed() +
  theme_minimal() +
  labs(title = "Cell-type assignment (spatial)") +
  theme(legend.title = element_blank())

marker_plots <- lapply(present_markers, function(g) {
  ggplot(df, aes(x, y, color = .data[[g]])) +
    geom_point(size = pt_size, alpha = pt_alpha) +
    scale_y_reverse() +
    coord_fixed() +
    theme_minimal() +
    labs(title = g, color = "log1p(expr)")
})

final_plot <- (p_score | p_label) / wrap_plots(marker_plots, ncol = 2)
final_plot

ggsave("marker_panels_spatial.png", final_plot, width = 10, height = 10, dpi = 300)

17 Feb 2026

« HW4: Identifying Straight Proximal Tubule in Visium and Xenium Datasets Validating Identity of Proximal Convoluted Tubule Segments »