Deciphering the Identity of a Biological Structure Using Proteomics Data


Sai C
I am a junior majoring in neuroscience, and I am from Minnesoota.

Deciphering the Identity of a Biological Structure Using Proteomics Data

In Figure 1, I chose to visualize quantitative positional data regarding the spatial organization of cells within the proteomics data set. I used the geometric primitive of points to represent single cells within our dataset. I used the visual channel of position along the x and y axes to encode the respective positions of cells across this novel structure, whose identity we are trying to decipher. Clearly, from this data visualization, we make more salient the point that there appears to be a space within our structure, shown in white, devoid of cells, while the peripheries of the structure are dense with cells. This clue will help us identify the biological entity being studied later on.

In Figure 2, I performed TSNE analysis (non-linear dimensional reduction), wherein which cells were organized in lower dimensional space based on their proteomic similarities (after normalization and log10 transformation of all protein expression values). I used the geometric primitive of points to represent single cells within our dataset. I also utilized the visual channel of position along the x and y axes to cluster cells based on proteomic similarity. Ultimately, the purpose of this data visualization is to show that cells form distinct groupings in the embedded space. This point is made poignantly given that cells, as indicated by points, can be grouped according to the Gestalt principle of grouping by proximity.

In Figure 3, I performed K-means clustering on the embedded space derived using TSNE. The geometric primitive of points was used to represent cells in the reduced dimensional space. Again, the visual channel of position along the x and y axes represents the grouping of cells based on similarity in proteomic profiles. The channel of color, specifically hue, was used to encode the cell clusters generated using K-means clustering. The purpose of this figure is to demonstrate that there are specific cell-clusters in our data set, and this point is strengthened by the Gestalt principle of grouping by similarity in color.

After generating these cell clusters, I randomly chose cluster 1 and compared it to the rest of the data set, using a “for loop” in combination with the Wilcoxon test (two-sided) to search for differentially expressed proteins. I plotted the results of this differential expression analysis of proteins using a heatmap. I found that CD15 levels were enriched in both clusters 1 and 6, and given that CD15 is a marker for myeloid cells, I concluded that clusters 1 and 6 must correspond to myeloid cells (e.g, granulocytes, monocytes, macrophages, and dendritic cells). It also appears clusters 2 and 5 have enriched levels of the CD8 protein, a marker for memory T cells, thereby leading me to conclude clusters 2 and 5 represent distinct populations of memory T-cells. Cluster 3 is enriched for vimentin, a marker for mesenchymal stem cells (undifferentiated), so perhaps cluster 3 corresponds to stem cells. And finally, cluster 4 has significantly higher levels of vimentin and CD15, so cluster 4 likely corresponds to mesenchymal stem cells differentiating into myeloid cells.

1
Overall, given that all our cell clusters correspond to immune cells or stem cells with the potential to differentiate into immune cells, the structure from which this data is derived is likely an immune system organ. Furthermore, from figure 1, it appears as if a blood vessel is surrounded by these immune cells, leading me to conclude that this structure likely corresponds to a region within the white pulp of the spleen. The white pulp of the spleen is innervated by blood vessels, but it is also rich with immune cells, thereby corroborating our findings.
data <- read.csv("/Users/ares2081/Desktop/codex_spleen_subset.csv")
head(data)
pos <- data[, c('x', 'y')]
rownames(pos) <- data[,1]
area <- data[,4]
pexp <- data[, 5:ncol(data)]
#CPM normalize to expression values of all proteins in the dataset
totprot <- rowSums(pexp)
normpexp <- pexp/totprot


#Plot raw cell data
library(ggplot2)
df1 <- data.frame(x = pos[,1], y = pos[,2])
p1 <- ggplot(data = df1,
             mapping = aes(x = x, y = y)) +
  geom_point(mapping = NULL, size=0.75) + 
  ggtitle("Figure 1: Spatial Distribution of Cells") + labs(x = "Medial-Lateral Position of Cell", y = "Inferior-Superior Positio of Cell") + 
  theme_bw() + theme(panel.grid.major = element_blank(), 
  panel.grid.minor = element_blank(), axis.ticks.x=element_blank(), 
  axis.text.x=element_blank(), axis.ticks.y=element_blank(),axis.text.y=element_blank())
p1


#Let's apply a non-linear dimensional reduction to our data (TSNE).
mat <- log10(normpexp+1)
library(Rtsne)
library(scattermore)
set.seed(0)
emb <- Rtsne(mat, dims=2, perplexity = 30)$Y
rownames(emb) <- rownames(mat)
#Plot MERFISH data in reduced 2-dimensional space
df2 <- data.frame(x = emb[,1], y = emb[,2])
p2 <- ggplot(data = df2, 
             mapping = aes(x = x, y = y)) + 
  geom_scattermore(mapping = NULL, pointsize=0.75) + 
  ggtitle("Figure 2: Visualizing Spatial Localization of Cells in Lower Dimensional Space Created Using TSNE") + 
  labs(x = "Emb1", y = "Emb2") + theme_bw() + theme(panel.grid.major = element_blank(), 
                                                    panel.grid.minor = element_blank(), axis.ticks.x=element_blank(), 
                                                    axis.text.x=element_blank(), axis.ticks.y=element_blank(),axis.text.y=element_blank())
p2
#There do not appear to be specific cell clusters in the embedding space, leading me to conclude we are likely dealing with cells that are similar in terms of proteomic expression.

#K-means clustering in embedding space
set.seed(0)
com <- kmeans(emb, centers=6)
#Number of centers selected on the basis of number of groups seen in lower dimensional spatial visualization. Since we didn't see many true clusters in the embedding space, I utilized a conservative estimate of 6 centers.
df3 <- data.frame(x = emb[,1],
                  y = emb[,2],
                  col = as.factor(com$cluster))
p3 <- ggplot(data = df3,
             mapping = aes(x = x, y = y)) +
  geom_scattermore(mapping = aes(col = col), pointsize=0.75) + 
  ggtitle("Figure 3: K-Means Clustering in a Lower Dimensional Space Created Using TSNE") + 
  labs(x = "Emb1", y = "Emb2", color = "Cell Cluster") + theme_classic() + theme_bw() + theme(panel.grid.major = element_blank(), 
                                                                                              panel.grid.minor = element_blank(), axis.ticks.x=element_blank(), 
                                                                                              axis.text.x=element_blank(), axis.ticks.y=element_blank(),axis.text.y=element_blank())
p3

#Identify the cell-types corresponding to clusters 1-6 using differential expression "for loop"
#Let g be a random protein from the data set
#Cluster 1
g <- 'FoxP3'
i <- com$cluster == 1
p_values <- sapply(colnames(mat), function(g) {
  x = wilcox.test(mat[i, g], mat[!i, g], 
                  alternative='two.sided')
  return(x$p.value)
})
#Since we are running the Wilcoxon test iteratively, we will need to apply a Bonferroni correction to account for false significant results.
adusted <- p.adjust(p_values)
#Compute fold changes for all proteins, comparing cluster 8 to the rest of the data set
fold_changes <- sapply(colnames(mat), function(g) {
  x = mean(mat[i, g])/mean(mat[!i, g])
  return(x)
})


#Generate heatmap of differentially expressed proteins
n <- names(which(adusted < 0.05))
n <- names(sort(fold_changes[n], decreasing=TRUE))
m <- mat[names(sort(com$cluster)),n]
col = rainbow(6)[as.factor(com$cluster)]
names(col) <- names(com$cluster)
heatmap(as.matrix(m), Rowv=NA, Colv=NA, scale='none',
        RowSideColors = col[rownames(m)])
#From the heatmap, it appears cluster 1 is enriched for CD15, a marker for myeloid cells (e.g., granulocytes, monocytes, macrophages, and dendritic cells).
#It appears cluster 2 has upregulated levels of CD8, a marker for memory T cells.
#Cluster 3 is enriched for vimentin, a marker for mesenchymal stem cells (undifferentiated)
#cluster 4 is enriched for both vimentin and CD15, perhaps a mesenchymal stem cell differentiating into myeloid cells.
#Cluster 5 is erniched for CD107a and CD8, which are both marker for T cells. 
#Like cluster 1, cluster 6 is also enriched for CD15, suggesting it also contains myeloid cells.
#Overall, the cells appear to be simply a vast collection of immune cells and stem cells. 
#Therefore, I conclude this data is derived from the white pulp of a human spleen, which is known to contain a bevy of immune cells.


#Plot cell clusters on original spatial map of cells
df4 <- data.frame(x = pos[,1],
                  y = pos[,2],
                  col = as.factor(com$cluster))
p4 <- ggplot(data = df4, mapping = aes(x = x, y = y)) + geom_scattermore(mapping = aes(col = col), pointsize=1.25) + 
  ggtitle("Figure 5: Cell Clusters Generated Using K-Means Clustering Plotted in a Spatially Resolved Manner on Mouse Spleen Map") + 
  labs(x = "Medial-Lateral Location of Cell", y = "Inferior-Superior Location of Cell", color = "Cell Cluster") + theme_classic() + theme_bw() + theme(panel.grid.major = element_blank(), 
                                                                                                                                                       panel.grid.minor = element_blank(), axis.ticks.x=element_blank(), 
                                                                                                                                                       axis.text.x=element_blank(), axis.ticks.y=element_blank(),axis.text.y=element_blank())
p4

#Multi-panel generation, since heatmap was made through native R, I will add this plot to the grid outside of R
library(gridExtra)
grid.arrange(p1, p2, p3, p4, ncol=2)