Impact of normalizing spatial transcriptomics data in dimensionality reduction and clustering versus deconvolution analysis with STdeconvolve
Introduction
We developed a computational method for analyzing multi-cellular pixel-resolution spatial transcriptomics (ST) data called STdeconvolve
that recovers the proportion of cell types comprising each multi-cellular spatially resolved pixel along with each cell types’ putative transcriptional profile without reliance on external single-cell transcriptomics references. More details regarding how the method works can be found in the published paper as well as on https://jef.works/STdeconvolve/.
Some have noticed that when analyzing ST data with STdeconvolve
, no normalization is done. Rather, an unnormalized gene expression counts matrix is provided. This is different from dimensionality reduction and clustering analysis, where gene expression per spot must be normalized to control for variation in sequencing depth or other factors that could lead to certain spots having more genes detected than others.
So why do we normalize when performing dimensionality reduction and clustering but not when performing deconvolution with STdeconvolve
? Here, I code in R to take a hands-on, simulation-based approach to prove to myself that normalization impacts the results of dimensionality reduction and clustering but not deconvolution with STdeconvolve
. I use a simulated spatial transcriptomics dataset with very obvious variation in terms of total genes detected per spatially resolved spot. I then perform dimensionality reduction and clustering analysis on both library-size normalized and unnormalized gene expression counts to find very different results. Likewise, I perform deconvolution analysis with STdeconvolve
on both the library-size normalized and unnormalized gene expression counts to find that the inferred cell-type proportions don’t change, suggesting that normalization is not necessary.
Follow along in the video or try out the code below for yourself!
Video
Code
#### simulate ST data
set.seed(0)
G <- 3
N <- 100
M <- 300
initmean <- 10
initvar <- 10
mat <- matrix(rnorm(N*M*G, initmean, initvar), M, N*G)
rownames(mat) <- paste0('gene', 1:M)
colnames(mat) <- paste0('cell', 1:(N*G))
ct <- factor(sapply(1:G, function(x) {
rep(paste0('ct', x), N)
}))
names(ct) <- colnames(mat)
## Visualize heatmap where each row is a gene, each column is a cell
## Column side color notes the cell-type
par(mfrow=c(1,1))
heatmap(mat,
Rowv=NA, Colv=NA,
col=colorRampPalette(c('blue', 'white', 'red'))(100),
scale="none",
ColSideColors=rainbow(G)[ct],
labCol=FALSE, labRow=FALSE)
set.seed(0)
upreg <- 100
upregvar <- 10
ng <- 100
diff <- lapply(1:G, function(x) {
diff <- rownames(mat)[(((x-1)*ng)+1):(((x-1)*ng)+ng)]
mat[diff, ct==paste0('ct', x)] <<-
mat[diff, ct==paste0('ct', x)] +
rnorm(ng, upreg, upregvar)
return(diff)
})
names(diff) <- paste0('ct', 1:G)
par(mfrow=c(1,1))
heatmap(mat,
Rowv=NA, Colv=NA,
col=colorRampPalette(c('blue', 'white', 'red'))(100),
scale="none",
ColSideColors=rainbow(G)[ct],
labCol=FALSE, labRow=FALSE)
range(mat)
## positive expression only
mat[mat < 0] <- 0
## make counts
mat <- round(mat)
spotpos <- cbind(unlist(lapply(1:30, function(i) rep(i,30))), 1:30)
rownames(spotpos) <- paste0('spot-', 1:nrow(spotpos))
colnames(spotpos) <- c('x','y')
dim(spotpos)
par(mfrow=c(1,1))
plot(spotpos, cex=1)
## mix 3 cell-types
ct1 <- names(ct)[ct == 'ct1']
ct2 <- names(ct)[ct == 'ct2']
ct3 <- names(ct)[ct == 'ct3']
nmix <- nrow(spotpos)/2
pct1 <- c(unlist(lapply(rev(1:nmix), function(i) i/nmix)), rep(0,nmix))
pct2 <- c(rep(0,nmix), unlist(lapply(1:nmix, function(i) i/nmix)))
pct3 <- 1-(pct1+pct2)
## Show proportion of each cell-type across spots
par(mfrow=c(3,1), mar=rep(2,4))
barplot(pct1, ylim=c(0,1), main='proportion ct1')
barplot(pct2, ylim=c(0,1), main='proportion ct2')
barplot(pct3, ylim=c(0,1), main='proportion ct3')
pct <- cbind(pct1, pct2, pct3)
rownames(pct) <- rownames(spotpos)
## Visualize as pie charts
STdeconvolve::vizAllTopics(pct, spotpos) +
ggplot2::guides(colour = "none")
## assume 10 cells per spot
ncells <- 10
## make gene expression matrix
spotmat <- do.call(cbind, lapply(1:nrow(spotpos), function(i) {
spotcells <- c(
sample(ct1, pct1[i]*ncells),
sample(ct2, pct2[i]*ncells),
sample(ct3, pct3[i]*ncells)
)
rowSums(mat[,spotcells])
}))
colnames(spotmat) <- rownames(spotpos)
## simulated ST data
head(pct)
head(spotmat)
dim(spotmat)
head(spotpos)
dim(spotpos)
#### tweaks to make normalization more prominent
tweak <- rownames(spotpos)[spotpos[,'y'] %in% seq(1,30,by=2)]
spotmattweak <- spotmat
spotmattweak[, tweak] <- spotmattweak[, tweak]*5
colSums(spotmattweak)
colSums(spotmat)
par(mfrow=c(1,1))
MERINGUE::plotEmbedding(spotpos, col=colSums(spotmattweak))
#### exploring impact of normalization
## what happens if we don't normalize?
## dimensionality reduction with PCA
## and clustering with kmeans
pcs <- prcomp(t(spotmattweak))
plot(pcs$x[,1:2], pch=16)
com <- kmeans(pcs$x[,1:3], centers = 3)
MERINGUE::plotEmbedding(pcs$x[,1:2], groups=com$cluster)
MERINGUE::plotEmbedding(spotpos, groups=com$cluster)
MERINGUE::plotEmbedding(pcs$x[,1:2], col=colSums(spotmattweak))
MERINGUE::plotEmbedding(spotpos, col=colSums(spotmattweak))
## what happens if we do normalize?
spotmattweaknorm <- t(t(spotmattweak)/colSums(spotmattweak))
spotmattweaknorm <- spotmattweaknorm*1e6
pcs2 <- prcomp(t(spotmattweaknorm))
plot(pcs2$x[,1:2], pch=16)
com2 <- kmeans(pcs2$x[,1:3], centers = 3)
MERINGUE::plotEmbedding(pcs2$x[,1:2], groups=com2$cluster)
MERINGUE::plotEmbedding(spotpos, groups=com2$cluster)
table(com$cluster, com2$cluster)
## deconvolution analysis
## no normalization
ldas <- fitLDA(t(as.matrix(spotmattweak)), Ks = 3)
optLDA <- optimalModel(models = ldas, opt = "min")
results <- getBetaTheta(optLDA, perc.filt = 0.05, betaScale = 1000)
deconProp <- results$theta
deconGexp <- results$beta
## visualize deconvolved cell-type proportions
vizAllTopics(deconProp, spotpos)
cor(pct, deconProp)
## with normalization
## requires rounding hack to get counts
## (not recommended because as you'll see we get the same answer anyway)
ldas2 <- fitLDA(t(as.matrix(round(spotmattweaknorm))), Ks = 3)
optLDA2 <- optimalModel(models = ldas2, opt = "min")
results2 <- getBetaTheta(optLDA2, perc.filt = 0.05, betaScale = 1000)
deconProp2 <- results2$theta
deconGexp2 <- results2$beta
## visualize deconvolved cell-type proportions
vizAllTopics(deconProp2, spotpos)
cor(pct, deconProp2)
cor(deconProp, deconProp2)
Other related posts
- Older
- Newer
Recent Posts
- Using AI to find heterogeneous scientific speakers on 04 November 2024
- The many ways to calculate Moran's I for identifying spatially variable genes in spatial transcriptomics data on 29 August 2024
- Characterizing spatial heterogeneity using spatial bootstrapping with SEraster on 23 July 2024
- I use R to (try to) figure out which hospital I should go to for shoppable medical services by comparing costs through analyzing Hospital Price Transparency data on 22 April 2024
- Cross modality image alignment at single cell resolution with STalign on 11 April 2024
Related Posts
- Using AI to find heterogeneous scientific speakers
- The many ways to calculate Moran's I for identifying spatially variable genes in spatial transcriptomics data
- Characterizing spatial heterogeneity using spatial bootstrapping with SEraster
- I use R to (try to) figure out which hospital I should go to for shoppable medical services by comparing costs through analyzing Hospital Price Transparency data