HW5

I am a senior BME undergraduate student from LA. My research interests include stem cells and regenerative medicine. Very excited to be here!

1. Figure Description

I created a multipanel figure to show the distribution of B cells and T cells in thhe spleen. Throughout, I used the gestalt principle of similarity to identify my cell types of interest by hue, being B cells in red and T cells in cyan. On the top row, I have the spatial distribution of the cells and color encoding cluder. Corresponding to the clusters of interest, I colored CD3e, a very important T cell marker in cyan as well. Then, I colored CD21, a B cell marker in red to correspond with the previous coloration of the clusters. Similarly, in th emiddle row, the same hue encoding and CD3e and CD21 protein expression were plotted, however this time in reduced dimensionality in tSNE space. Throughtout these plots, i also colored the cells that I did not classify in grey to make my cell clusters of interest more salient. Finally in the bottom row, I plotted the differential protein expression of all of the proteins of interest with the size encoding the p-value that a protein was differentially upregulated in this cluster while the saturation of the dot represents the average expression of the gene.

2. Cell type identification

I determined my cluster 3 to represent T cells because of its differential upregulation of key T cell markers like CD3e, CD45RO, CD4, CD45, and CD5. CD3e is a well known marker of all T cells and it is a subunit of the TCR-CD3 complex expressed on virtually all mature T cells and is essential for T cell receptor signaling and activation of both helper and cytotoxic T cells. (https://www.nature.com/articles/s41467-019-12464-3) CD4 is a common identifier of helper T cells, stabilizing MHC class II interactions during antigen recognition (https://www.nature.com/articles/s41467-019-12464-3 ). CD45 is a transmembrane phosphatase expressed broadly on hematopoietic cells but plays a particularly important role in T cell activation by regulating TCR signal transduction. (https://pubmed.ncbi.nlm.nih.gov/9429890/) Finally CD5 is also a very important glycoprotein expressed on all T cells that modulates TCR signaling thresholds and is important for T cell development and activation (https://pmc.ncbi.nlm.nih.gov/articles/PMC5555168/)

I was also able to identify my cluster 5 as B cells because of its high differential expression of CD21, CD20, and CD35 with a noticable lack of T cell markers like CD3e and CD4. CD21 is expressed on mature and transitional B cells, where it forms part of the B cell coreceptor complex with CD19 and enhances B cell activation upon binding complement-opsonized antigens; (https://ashpublications.org/blood/article/115/3/519/27130/Differential-expression-of-CD21-identifies). CD20 is also a B cell-specific transmembrane protein expressed from the pre-B cell stage until mature and memory B cells, playing a role in B cell receptor signaling(https://www.pnas.org/doi/10.1073/pnas.2021342118). Additionally CD35 works with CD20 in the humoral immune response that retains anitgens for B cell stimulation. (https://pmc.ncbi.nlm.nih.gov/articles/PMC9939015/). Finally, although many of the genes were shred between my B and T cell clusters, the lack of CD4, CD45RO, and CD3e supported that these were B lymphocytes instead of T cells.

3. Tissue identification

According to my data, the dominant cell types in the kidney shaped structure in the center of the physical space are lymphocytes. Therefore, I beleive that this structure is most likely white pulp, a lymphocyte rich region of the spleen that acts as a hub for immune activation and response. (https://pubmed.ncbi.nlm.nih.gov/10671220/#:~:text=The%20article%20describes%20a%20technique%20for%20isolating,pulp%20greatly%20exceeding%20that%20into%20lymph%20nodes). In standard histology of the spleen we can see that many of the cells around them are redpulp and I suspect the same here based on the spatial distribution of the B and T cells (https://vmicro.iusm.iu.edu/hs_vm/docs/lab7_6.htm). In red pulp, there is also a noteable lack of lymphocytes and since it the lymphocytes are centered in this data, I assume that is the region of interest. In an artery or vein, we would also not expect to see so many lymphocytes and a hollow area that blood usually flows through. This does not match our spatial structure. Additionally, the density of lymphocytes in the structure does not resemble peripheral blood that is usually only 1% leukocytes (https://www.ncbi.nlm.nih.gov/books/NBK2263/). Similarly, the Capsule/Trabecula of a spleen are that primarily consist of dense connective tissue and smooth muscle. However, markers like Vimentin and SM actin were only sparesly expressed throughout the kidney-shaped structure of interest.

library(ggplot2)
library(dplyr)
data <- read.csv('/Users/willli/Documents/BME 25-26/Genomic Data Visualization/genomic-data-visualization-2026/data/codex_spleen2.csv.gz')
data[1:8,1:8]

dim(data)
pos <- data[, 2:3]
head(pos)
pexp <- data[, 5:ncol(data)]
head(pexp)
area <- data[, 4]
names(area) <- rownames(pos) <- rownames(pexp) <- data[,1]
  
## QC
hist(area, breaks=50)    #looks like a big jump from<100 and long tail > 700
df = data.frame(area)
area <- area[area >= quantile(area, 0.025) & area <= quantile(area, 0.975)] #95% interval to remove outliers
hist(area, breaks = 50)
pexp <- pexp[names(area), ]
pos <- pos[names(area), ]
mat <- log10(pexp+1)

df = data.frame(area, totpexp = rowSums(pexp))
ggplot(df, aes(x=log10(area), y=log10(totpexp))) + geom_point()  #Little correlation with size of cell vs protein so no normalization
ggplot(df, aes(x = 0, y = log10(totpexp))) + geom_violin()        

# Pexp vs xy
df = data.frame(area, totpexp = rowSums(pexp), pos)
ggplot(df, aes(x=x, y=y, col=log10(totpexp))) + geom_point(size=0.5)

## Highly variable proteins vs xy
dim(pexp)
hvgs <- names(sort(apply(pexp, 2, var), decreasing=TRUE))[1:10]
df = data.frame(pos, pexp)
ggplot(df, aes(y=1, x=log10(CD15))) + geom_violin()
for(i in hvgs){
  print(ggplot(df, aes(x=x, y=y, col=log10(.data[[i]]))) + geom_point(size=0.5) + labs(title = i))
}
interesting_genes <- c('CD3e', 'CD20', 'CD21')

# PCA
pcs <- prcomp(mat, center=T, scale=F)
toppcs <- pcs$x[, 1:10]
plot(pcs$sdev)
tsne <- Rtsne::Rtsne(toppcs, dims = 2, perpexity = 30)
emb <- tsne$Y
rownames(emb) <- rownames(mat)
colnames(emb) <- c('tsne1', 'tsne2')

clusters <- as.factor(kmeans(toppcs, centers=7)$cluster)

df_all <- data.frame(pos, emb, mat, toppcs, clusters)
rownames(df_all) <- rownames(mat)
ggplot(df_all, aes(x=x, y=y, col=clusters)) + geom_point()
ggplot(df_all, aes(x=tsne1, y=tsne2, col=clusters)) + geom_point()
ggplot(df_all, aes(x=tsne1, y=tsne2, col=CD3e)) + geom_point()
ggplot(df_all, aes(x=x, y=y, col=CD3e)) + geom_point()
ggplot(df_all, aes(x=tsne1, y=tsne2, col=CD21)) + geom_point()
ggplot(df_all, aes(x=x, y=y, col=CD21)) + geom_point()

# wilcox test
proteins <- names(mat)
avg_expr <- matrix(NA, nrow=length(unique(clusters)), ncol=length(proteins),
                   dimnames=list(sort(unique(clusters)), proteins))
sig_out  <- matrix(NA, nrow=length(unique(clusters)), ncol=length(proteins),
                   dimnames=list(sort(unique(clusters)), proteins))
for (cl in sort(unique(clusters))) {
  idx1 <- which(clusters == cl)
  idx2 <- which(clusters != cl)
  
  for (prot in proteins) {
    x1 <- mat[idx1, prot]
    x2 <- mat[idx2, prot]
    
    avg_expr[cl, prot] <- mean(x1)
    sig_out[cl, prot] <- -log10(wilcox.test(x1, x2, alternative='greater')$p.value)
  }
}

df_dot <- data.frame(
  cluster = rep(rownames(avg_expr), times=length(proteins)),
  protein = rep(proteins, each=length(unique(clusters))),
  avg_expression = as.vector(avg_expr),
  sig = as.vector(sig_out)
)

dot_plot <- function(df){
  ggplot(df, aes(x=protein, y=as.factor(cluster), 
                 color=avg_expression, 
                 size=sig)) +
    geom_point(aes(size = ifelse(sig >= 3, sig, NA) )) +
    scale_color_gradient(low='grey', high='blue') +
    scale_size_continuous(breaks=c(10, 100, 200)) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle=45, hjust=1)) +
    labs(x='Protein', y='Cluster', 
         color='log10 Avg Expression', 
         size='-log10(p-value)',
         title = "Dotplot of protein expression by cell type cluster")
}


df_dot$sig[is.infinite(df_dot$sig)] <- max(df_dot$sig[is.finite(df_dot$sig)])
dot_plot(df_dot)


df_dot$cluster <- recode(df_dot$cluster, '5' = 'B cells', '3' = 'T cells')
df_all$clusters <- recode(df_all$clusters, '5' = 'B cells', '3' = 'T cells')

protein_order <- hclust(dist(t(avg_expr)))$order
ordered_proteins <- colnames(avg_expr)[protein_order]
df_dot$protein <- factor(df_dot$protein, levels=ordered_proteins)

dot_plot_fig <- dot_plot(df_dot)

cells_xy <- ggplot(df_all %>% arrange(desc(clusters != "B cells" | clusters != 'T cells')), aes(x=x, y=y, 
                                  col = case_when(
                                    as.character(clusters) == "T cells" ~ "T cells",
                                    as.character(clusters) == "B cells" ~ "B cells",
                                    TRUE ~ "Other"
                                  ))) +
  geom_point(alpha=0.5, size=0.5) +
  scale_color_manual(values = c("B cells" = "red", 
                                "T cells" = "cyan", 
                                "Other" = "grey")) +
  labs(title='Cell types in physical space', color='Cell type')

cells_tsne <- ggplot(df_all %>% arrange(desc(clusters != "B cells" | clusters != 'T cells')), aes(x=tsne1, y=tsne2, 
                                  col = case_when(
                                    as.character(clusters) == "B cells" ~ "B cells",
                                    as.character(clusters) == "T cells" ~ "T cells",
                                    TRUE ~ "Other"
                                  ))) +
  geom_point(alpha=0.5, size=0.5) +
  scale_color_manual(values = c("B cells" = "red", 
                                "T cells" = "cyan", 
                                "Other" = "grey")) +
  labs(title='Cell types in tSNE space', color='Cell type')


B_cells <- subset(df_dot, cluster %in% 'B cells')
B_cells[B_cells$sig >20 , ]


T_cells <- subset(df_dot, cluster %in% 'T cells')
T_cells[T_cells$sig >20 , ]

CD3e_xy <- ggplot(df_all %>% arrange(desc(clusters != 'T cells')), aes(x=x, y=y, col=CD3e))+geom_point(size=0.5, alpha=0.5) +
  scale_color_gradient2(low='grey', mid='grey', high='cyan', midpoint=median(df_all$CD3e)) +
  labs(title='CD3e in physical space', color='log10(CD3e expr)')
CD3e_tsne <- ggplot(df_all %>% arrange(desc(clusters != 'T cells')), aes(x=tsne1, y=tsne2, col=CD3e))+geom_point(size=0.5, alpha=0.5) +
  scale_color_gradient2(low='grey', mid='grey', high='cyan', midpoint=median(df_all$CD3e)) +
  labs(title='CD3e in tSNE space', color='log10(CD3e expr)')


CD21_xy <- ggplot(df_all %>% arrange(desc(clusters != "B cells")), aes(x=x, y=y, col=CD21))+geom_point(size=0.5, alpha=0.5) +
  scale_color_gradient2(low='grey',  high='red', midpoint=mean(df_all$CD21)) +
  labs(title='CD21 in physical space', color='log10(CD21 expr)')

CD21_tsne <- ggplot(df_all %>% arrange(desc(clusters != "B cells")), aes(x=tsne1, y=tsne2, col=CD21))+geom_point(size=0.5, alpha=0.5) +
  scale_color_gradient2(low='grey', high='red', midpoint=mean(df_all$CD21)) +
  labs(title='CD21 in tSNE space', color='log10(CD21 expr)')

(cells_xy + CD3e_xy + CD21_xy)/(cells_tsne + CD3e_tsne + CD21_tsne)/(dot_plot_fig) +
  plot_annotation(title = "Spatial Proteomics of T cells and B cells in Spleen White Pulp")

20 Feb 2026

« HW5 Animated Comparison of Normalized vs. Non-Normalized Acox2 Expression »