HW5-Interpreting Spleen Tissue Structure from CODEX Spatial Proteomics Data

Hi everyone, I’m a second-year PhD student in Biomedical Engineering, and my research focuses on medical imaging. I’m very happy to meet you all and looking forward to learning together.

HW5-Interpreting Spleen Tissue Structure from CODEX Spatial Proteomics Data

1. Describe your figure briefly.

I performed a full CODEX analysis to infer which spleen tissue structure is represented. After loading the dataset, I separated spatial coordinates (x, y), cell area, and protein expression features, then applied a basic quality-control filter by removing low-signal protein channels (keeping markers with total signal > 1000 across cells). To make cells comparable despite differences in overall staining intensity, I normalized each cell’s protein profile by its total signal , rescaled by the global mean total signal, and applied a log10 transform. I then selected the clustering resolution using the elbow method based on within-cluster sum of squares (WSS) for k = 1–10 and chose k = 6 as a balance between over- and under-partitioning. Using k-means (k = 6), I visualized clusters in a low-dimensional embedding by running PCA on the normalized matrix and applying t-SNE to the first 20 PCs (perplexity = 30), and I also mapped the same clusters back to physical space to assess whether they form coherent spatial structures.

Among the six clusters, clusters 1 and 5 showed clear separation in t-SNE space and non-random spatial organization, so I focused on these two for detailed interpretation. To identify marker proteins defining each cluster, I performed cluster-vs-rest differential expression using Wilcoxon rank-sum tests for each marker and computed log2 fold-changes, summarizing results with volcano plots and highlighting key markers. Cluster 1 showed strong upregulation of CD8, consistent with a CD8 T cell–rich population[1], while cluster 5 showed strong upregulation of CD21 and elevated CD3e/CD45RO, consistent with a lymphocyte-rich immune compartment with prominent B cell–associated signal and T cell–associated markers[2]. To validate that these signatures correspond to tissue organization rather than purely statistical clusters, I plotted continuous marker expression for CD8 and CD21 in both t-SNE space and physical space, demonstrating spatially patterned enrichment, and I further compared CD8, CD21, CD3e, and CD45RO across all six clusters using violin plots (highlighting clusters 1 and 5). Taken together, the presence of spatially organized, lymphocyte-associated compartments—one enriched for CD8 T cells and another enriched for CD21 with additional T cell markers—most strongly supports the interpretation that the imaged structure corresponds to spleen white pulp (immune-rich regions with T- and B-cell–associated zones), rather than vascular, red pulp, or capsule/trabecular structures[3].

[1]Martin, Matthew D., and Vladimir P. Badovinac. “Defining memory CD8 T cell.” Frontiers in immunology 9 (2018): 416271. [2]Thorarinsdottir, Katrin, et al. “CD21–/low B cells in human blood are memory cells.” Clinical & Experimental Immunology 185.2 (2016): 252-262. [3]Nolte, Martijn A., et al. “Isolation of the intact white pulp. Quantitative and qualitative analysis of the cellular composition of the splenic compartments.” European journal of immunology 30.2 (2000): 626-634.

2. Code (paste your code in between the ``` symbols)

library(ggplot2)
library(Rtsne)
library(dplyr)
library(cluster)
library(reshape2)
library(ggrepel)

# Load dataset
codex_file <- '/Users/xl/Desktop/JHU2026Spring/Genomic-Data-Visualization/Homework/HW5/codex_spleen2.csv.gz'
codex_data <- read.csv(codex_file, row.names = 1)

# Separate spatial coordinates, cell area, and protein expressions
spatial_coords <- codex_data[, 1:2]
cell_area <- codex_data[, 3]
protein_data <- codex_data[, 4:ncol(codex_data)]

# Filter and normalize gene expression
protein_data_filter <- protein_data[, colSums(protein_data) > 1000]
protein_data_norm <- log10(protein_data_filter / rowSums(protein_data_filter) * mean(rowSums(protein_data_filter)) + 1)

# Determine optimal k
wss <- sapply(1:20, function(k) {
  kmeans(protein_data_norm, centers = k, nstart = 10, iter.max=50)$tot.withinss
})

optimal_k_plot <- ggplot(data.frame(k = 1:20, wss = wss), aes(x = k, y = wss)) +
  geom_line() +
  geom_point() +
  theme_minimal() +
  ggtitle("Determine optimal K")

# We use k=6 
optimal_k_plot

# Clustering using k-means
set.seed(123)
k <- 6
kmeans_result <- kmeans(protein_data_norm, centers = 6)
clusters <- as.character(kmeans_result$cluster)
table(clusters)   

pcs <- prcomp(protein_data_norm, center = TRUE, scale. = FALSE)
set.seed(123)
tsne_result <- Rtsne(pcs$x[, 1:20], perplexity = 30)

tsne_df <- data.frame(
  tSNE1 = tsne_result$Y[, 1],
  tSNE2 = tsne_result$Y[, 2],
  cluster = clusters
)


pos_df <- data.frame(
  x = spatial_coords$x,
  y = spatial_coords$y,
  cluster = clusters
)


# ---------------------------------------------------------
# 6) Explore ALL clusters first (before choosing focus ones)
# ---------------------------------------------------------
p_all_tsne <- ggplot(tsne_df, aes(tSNE1, tSNE2, color = cluster)) +
  geom_point(alpha = 0.6, size = 0.5) +
  theme_bw() +
  ggtitle(paste0("All clusters (k=", k, ") in tSNE space"))

p_all_space <- ggplot(pos_df, aes(x, y, color = cluster)) +
  geom_point(alpha = 0.8, size = 0.5) +
  coord_fixed() +
  theme_bw() +
  ggtitle(paste0("All clusters (k=", k, ") in physical space"))

(p_all_tsne | p_all_space)

focus_clusters <- c("1", "5")
make_highlight <- function(clusters, keep) {
  clusters_chr <- as.character(clusters)
  hl <- ifelse(clusters_chr %in% keep, clusters_chr, "Other")
  factor(hl, levels = c("Other", keep))
}

clusters_hl <- make_highlight(clusters, focus_clusters)

pos_df <- data.frame(spatial_coords, clusters_hl)
colnames(pos_df)[1:2] <- c("x", "y")

tsne_df$clusters_hl <- clusters_hl
pos_df$clusters_hl <- clusters_hl
# colors: no red/green
hl_colors <- c(
  "Other" = "grey92",
  "1" = "#3F007D", 
  "5" = "#D55E00" 
)


p1 <- ggplot(tsne_df, aes(x = tSNE1, y = tSNE2, color = clusters_hl)) +
  geom_point(alpha = 0.6, size = 0.5) +
  scale_color_manual(values = hl_colors) +
  ggtitle("Highlighted clusters in tSNE space") +
  theme_bw()
p1
p2 <- ggplot(pos_df, aes(x = x, y = y, color = clusters_hl)) +
  geom_point(alpha = 0.8, size = 0.5) +
  scale_color_manual(values = hl_colors) +
  ggtitle("Highlighted clusters in physical space") +
  theme_bw()



run_de <- function(target_cluster, protein_mat, clusters_raw) {
  pv <- sapply(colnames(protein_mat), function(g) {
    wilcox.test(protein_mat[clusters_raw == target_cluster, g],
                protein_mat[clusters_raw != target_cluster, g])$p.value
  })
  logfc <- sapply(colnames(protein_mat), function(g) {
    log2(mean(protein_mat[clusters_raw == target_cluster, g]) /
           mean(protein_mat[clusters_raw != target_cluster, g]))
  })
  
  df <- data.frame(gene = colnames(protein_mat),
                   logfc = logfc,
                   logpv = -log10(pv + 1e-300))
  
  df$diffexp <- "Not Significant"
  df[df$logpv > 2 & df$logfc > 0.6, "diffexp"] <- "Upregulated"
  df[df$logpv > 2 & df$logfc < -0.6, "diffexp"] <- "Downregulated"
  df$diffexp <- factor(df$diffexp, levels = c("Not Significant", "Downregulated", "Upregulated"))
  df
}

clusters_raw <- as.character(kmeans_result$cluster)
tsne_df$hl <- ifelse(clusters_raw %in% c("1","5"), clusters_raw, NA)
pos_df$hl  <- ifelse(clusters_raw %in% c("1","5"), clusters_raw, NA)


df_de_1 <- run_de("1", protein_data_norm, clusters_raw)
df_de_5 <- run_de("5", protein_data_norm, clusters_raw)
get_top_up <- function(df_de, top_n = 10) {
  df_de %>%
    filter(diffexp == "Upregulated") %>%
    arrange(desc(logpv)) %>% 
    head(top_n)
}
top_up_1 <- get_top_up(df_de_1, top_n = 10)
top_up_5 <- get_top_up(df_de_5, top_n = 10)
top_up_1
top_up_5


# Genes to label on each volcano plot
label_genes_p3 <- c("CD8")                           # Cluster 1: label CD8 only
label_genes_p4 <- c("CD21", "CD3e", "CD45RO") # Cluster 5: label these three

# Volcano plot helper: recolor points and circle + label selected markers
volcano_custom <- function(df_de, title, label_genes) {
  df_lab <- df_de[df_de$gene %in% label_genes, ]
  
  ggplot(df_de, aes(x = logfc, y = logpv, color = diffexp)) +
    geom_point(size = 2, alpha = 0.9) +
    # Circle selected markers (black outline)
    geom_point(
      data = df_lab,
      aes(x = logfc, y = logpv),
      inherit.aes = FALSE,
      shape = 21, stroke = 1.1, color = "red", fill = NA, size = 4
    ) +
    # Label selected markers
    geom_text_repel(
      data = df_lab,
      aes(x = logfc, y = logpv, label = gene),
      inherit.aes = FALSE,
      color = "red",
      size = 3.2,          # smaller text for tight panels
      box.padding = 0.5,  # less whitespace around labels
      point.padding = 0.5,# less whitespace around points
      force = 1.2,         # gentler repulsion (less jumping)
      max.overlaps = Inf,
      min.segment.length = 0,
      segment.size = 0.25  # thinner connector line
    ) +
    # Color scheme: Downregulated = blue, Upregulated = orange
    scale_color_manual(values = c(
      "Not Significant" = "grey70",
      "Downregulated"   = "#4C78A8",
      "Upregulated"     = "red"
    )) +
    theme_bw() +
    ggtitle(title) +
    xlab("log2 fold-change") +
    ylab("-log10(p-value)")
}

# P3: Cluster 1 vs Others (label CD8 only)
p3 <- volcano_custom(df_de_1, "Differential Expression: Cluster 1 vs Others", label_genes_p3)
# P4: Cluster 5 vs Others (label CD21, CD3e, CD45RO)
p4 <- volcano_custom(df_de_5, "Differential Expression: Cluster 5 vs Others", label_genes_p4)


# ---------------------------------------------------------
# 7) Marker expression maps (tSNE + physical space)
#    - CD8: tSNE + physical
#    - CD21: tSNE + physical
# ---------------------------------------------------------

plot_marker <- function(gene, low_col = "grey90", high_col = "#4C78A8") {
  tsne_df$gene_expression <- protein_data_norm[, gene]
  pos_df$gene_expression  <- protein_data_norm[, gene]
  
  p_tsne <- ggplot(tsne_df, aes(tSNE1, tSNE2, color = gene_expression)) +
    geom_point(size = 0.5) +
    scale_color_gradient(name = paste0(gene, " (log norm)"),
                         low = low_col, high = high_col) +
    ggtitle(paste(gene, "expression in tSNE space")) +
    theme_bw() +
    theme(legend.position = "right")
  
  p_space <- ggplot(pos_df, aes(x, y, color = gene_expression)) +
    geom_point(size = 0.5) +
    scale_color_gradient(name = paste0(gene, " (log norm)"),
                         low = low_col, high = high_col) +
    ggtitle(paste(gene, "expression in physical space")) +
    theme_bw() +
    theme(legend.position = "right")
  list(p_tsne = p_tsne, p_space = p_space)
}

# CD8 (cluster 1 marker): deep purple
pm_cd8  <- plot_marker("CD8",  low_col = "#F2F2F2", high_col = "#3F007D")

# CD21 (cluster 5 marker): deep orange
pm_cd21 <- plot_marker("CD21", low_col = "#F2F2F2", high_col = "#D55E00")


# -----------------------------
# Violin plots: All clusters
# -----------------------------
markers <- c("CD8", "CD21", "CD3e", "CD45RO")  # CD45RO (letter O)
clusters_all <- sort(unique(as.character(clusters_raw)))
df_all <- data.frame(
  cluster = factor(clusters_raw, levels = clusters_all),
  protein_data_norm[, markers, drop = FALSE]
)
df_all_long <- reshape2::melt(
  df_all,
  id.vars = "cluster",
  variable.name = "marker",
  value.name = "expr"
)

cluster_cols6 <- c(
  "1" = "#3F007D", 
  "2" = "grey80",
  "3" = "grey80",
  "4" = "grey80",
  "5" = "#D55E00",   
  "6" = "grey80"
)

p_violin_all <- ggplot(df_all_long, aes(x = cluster, y = expr, fill = cluster)) +
  geom_violin(trim = TRUE, scale = "width", color = NA, alpha = 0.9) +
  geom_boxplot(width = 0.18, outlier.size = 0.3, alpha = 0.6) +
  facet_wrap(~ marker, nrow = 1, scales = "free_y") +
  scale_fill_manual(values = cluster_cols6) +
  theme_bw() +
  theme(
    strip.background = element_rect(fill = "grey95", color = NA)
  ) +
  xlab("Cluster") +
  ylab("log-normalized expression") +
  ggtitle("Marker expression across 6 clusters")

p1
p2
p3
p4
pm_cd8$p_tsne
pm_cd8$p_space
pm_cd21$p_tsne
pm_cd21$p_space
p_violin_all

#AI prompts used
#“How can I create violin plots in ggplot2 to compare the expression of multiple markers across six k-means clusters, while highlighting two clusters of interest with distinct colors and displaying the remaining clusters in gray?”

19 Feb 2026

« B and T compartments in white pulp HW5: Multi-Panel Data Visualization of the White Pulp Tissue Structure in the CODEX Data »