Cluster characterization in sequencing-based spatial transcriptomics


Kevin Meza Landeros
I am a fisrt-year PhD student in BME at Johns Hopkins university at Rachel Karchin's Lab. I work with omics data to unveil the immune tumor microenvironment. Apart from science I love exercising and watching series

Cluster characterization in sequencing-based spatial transcriptomics

Homework 3

Figure description

Here, I am analyzing an sequencing-based spatial transcriptomics dataset. To account for the fact that cells can have different sequencing depths, data was normalized: raw counts were divided by total cell counts, multiplied by the scale factor 10000 and the result was log transformed. Likewise, data was scaled to prevent genes with the highest mean from contributing the most to the variabliity. PCA was performed to reduce the dimensionality of the dataset by eliminating the effect of gene correlation. Then, k-means was used to find clusters of cells, using PCs as input. Lastly, t-SNE was used for visual purposes and calculated using PCs as input, were cells with similar transcriptome appear together.

To further understand the tissue, cluster 6 was chosen and characterized. Panel A shows cells and their coordinates in the 1st and 2nd dimensions of the reduced space; highlighting C6 with a distinct color hue. Panel B shows the spatial location of C6 cells which most likely correspond to an specific cell type.

To identify the disctinct features of C6, differentially expressed genes (DEGS) were calculated. A two-sided Wilcoxon-test for each gene was performed comparing cells from C6 vs the REST, using normalized data. This followed by Benjamini Hochberg FDR for multiple test correction. All DEGs having padj<0.05 and abs(log2FC)>0.5 are shown in Panel C. Gene “PLA2G2D” had the highest logFC of 2.72 amongst genes with mean gene expression >0.1 in C6. Panel D and Panel E show the normalized expression values of gene “PLA2G2D” for every cell in the t-SNE coordinates and their spatial location.

As a last step, SingleR (Dvir Aran, 2019) package was used to label cells based on their normalized gene expression data using the Human Primary Cell Atlas as a reference panel. Panel F depicts the predicted cell-type frequency across C6 cells. Results suggest that cluster corresponds to stem-cells (CD44, ZEB2) and within those spots there might be fibroblasts (CXCL13), T cells (IL7R, TRBC2), and some Macrophanges (PLA2G2D, TGFB1), Monocytes (FCGR3A) as well. As a verification step, the “upregulated genes” were looked up in “THE HUMAN PROTEIN ATLAS” (https://www.proteinatlas.org/). After looking at the breast cancer tissue I confirmed that PLA2G2D and TGFB1 are expressed in Macrophages, IL7R and TRBC2 in T-cells, and CXCL13 is expressed in fibroblasts.

Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
# - Import libraries
library(ggplot2)
library(tidyverse)
library(cowplot)
library(patchwork)
library(Rtsne)
library(DESeq2)
library(SingleR); library(celldex); library(biomaRt)

# - Read data
file <- '/Users/kmlanderos/Documents/Johns_Hopkins/Spring_2025/Genomic_Data_Visualization/genomic-data-visualization-2025/data/eevee.csv.gz'
data <- read.csv(file, row.names=1)
gexp<-data[,4:ncol(data)]
gexp<-gexp[,colSums(gexp)!=0] # Remove genes with no expression across cells
set.seed(100)

# - Normalize and scale data first
norm_data<-log2(10000*gexp/rowSums(gexp) + 1)
scaled_data<-scale(norm_data)

# - PCA
pca <- prcomp(scaled_data)

<How to choose PCs?: Elbow PLOT>
<How to choose k-means?>

# - K means
k=7
res<-kmeans(pca$x[,1:15], centers=k)
clusters<- as.factor(res$cluster)
# - t-SNE
tsne <- Rtsne(pca$x[,1:15])

popo <- data.frame(data$barcode, clusters, X_coord=data$aligned_x, Y_coord=data$aligned_y, pca$x[,1:3], tSNE_1= tsne$Y[,1], tSNE_2= tsne$Y[,2])
ggplot(popo, aes(x=tSNE_1, y=tSNE_2, color=clusters)) + geom_point(size=0.1)

# - DEGs
cluster <- 6
condition1_cells <- names(clusters)[clusters == cluster]
condition2_cells <- names(clusters)[clusters != cluster]
# Test all genes
cluster_comparisons <- sapply(1:ncol(norm_data), function(i){
  gene<-colnames(norm_data)[i]
  out <- wilcox.test(norm_data[condition1_cells,i], norm_data[condition2_cells,i], alternative='two.sided'); res <- out$p.value; names(res) <- gene
  res
})
genes_df<-data.frame(clust_mean=colMeans(norm_data[condition1_cells,]), rest_mean=colMeans(norm_data[condition2_cells,]), log2FC=log2(colMeans(norm_data[condition1_cells,])/colMeans(norm_data[condition2_cells,])), p_value=cluster_comparisons, p_adj=p.adjust(cluster_comparisons, method = "BH")) %>% mutate(
    log_p_adj=-log10(p_adj),
    significance = case_when(
      log2FC > 0.50 & p_adj < 0.05 ~ "Upregulated",
      log2FC < -0.50 & p_adj < 0.05 ~ "Downregulated",
      TRUE ~ "Not Significant"
    )
  )
  
# Run SingleR
ref <- celldex::HumanPrimaryCellAtlasData() # Load reference (Human Primary Cell Atlas)
singleR_results <- SingleR(test = t(norm_data), ref = ref, labels = ref$label.main)

# - Plots
# Data Frame
df <- data.frame(data$barcode, clusters, X_coord=data$aligned_x, Y_coord=data$aligned_y, pca$x[,1:3], tSNE_1= tsne$Y[,1], tSNE_2= tsne$Y[,2], celltype_singleR=singleR_results$labels) %>% mutate(Group = case_when(
      clusters == 6 ~ "C6",
      clusters != 6 ~ "REST"
    ))

# Plot Panel A 
p1<-ggplot(df, aes(x=tSNE_1, y=tSNE_2, color=Group)) + geom_point(size=2.0) + scale_color_manual(values = c("C6" = "black", "REST" = "gray80")) + labs(x="tSNE 1", y="tSNE 2") + theme_classic()
# Plot Panel B
p2<-ggplot(df, aes(x=X_coord, y=Y_coord, color=Group)) + scale_color_manual(values = c("C6" = "black", "REST" = "gray80")) +
labs(x="X coordinate", y="Y coordinate") + geom_point(size=2.0) + theme_classic()
# Plot Panel C
p3<-ggplot(genes_df, aes(x=log2FC, y=log_p_adj, color=significance)) + geom_point(size=1) + scale_color_manual(values = c("Upregulated" = "red", "Downregulated" = "blue", "Not Significant" = "gray80")) +
  labs(title = "DEGs (REST VS C6)",
       x = "Log2(Fold Change)",
       y = "-Log10(Adj P-Value)") + theme_classic()
# Plot Panel D
gene <- genes_df %>% filter(p_adj<0.05 & clust_mean>0.1) %>% arrange(desc(log2FC)) %>% head(1) %>% rownames()
p4<-ggplot(df, aes(x=tSNE_1, y=tSNE_2, color=norm_data[,gene])) + geom_point(size=2.0) +
scale_colour_gradient(low = "gray80", high = "#5800FF") + labs(x="tSNE 1", y="tSNE 2", color="PLA2G2D") + theme_classic()
# Plot Panel E
p5<-ggplot(df, aes(x=X_coord, y=Y_coord, color=norm_data[,gene])) + geom_point(size=2.0) + scale_colour_gradient(low = "gray80", high = "#5800FF") +
labs(x="X coordinate", y="Y coordinate", color="PLA2G2D ") + theme_classic()
# Plot Panel F
ct_df <- df %>% filter(clusters == 6) %>% pull(celltype_singleR) %>% table() %>% sort() %>% reshape2::melt(); colnames(ct_df) <- c('Celltype','cells')
p6<-ggplot(ct_df, aes(x=Celltype, y=cells)) + geom_bar(stat="identity") + labs(x="Cell type", y="# of cells", color="Cell type", title="SingleR C6 celltype prediction") + theme_classic()+ theme(axis.text.x = element_text(angle = 75, vjust = 0.5, hjust=0.5))
genes_df %>% filter(p_adj<0.05) %>% arrange(desc(log2FC)) %>% head()

# Plot all Figures
plot_grid(p1, p2, p3, p4, p5, p6, labels = "AUTO")