Impact of normalizing spatial transcriptomics data in dimensionality reduction and clustering versus deconvolution analysis with STdeconvolve
May 4, 2023
Introduction
We developed a computational method for analyzing multi-cellular pixel-resolution spatial transcriptomics (ST) data called STdeconvolve
that recovers the proportion of cell types comprising each multi-cellular spatially resolved pixel along with each cell types’ putative transcriptional profile without reliance on external single-cell transcriptomics references. More details regarding how the method works can be found in the published paper as well as on https://jef.works/STdeconvolve/.
Some have noticed that when analyzing ST data with STdeconvolve
, no normalization is done. Rather, an unnormalized gene expression counts matrix is provided. This is different from dimensionality reduction and clustering analysis, where gene expression per spot must be normalized to control for variation in sequencing depth or other factors that could lead to certain spots having more genes detected than others.
So why do we normalize when performing dimensionality reduction and clustering but not when performing deconvolution with STdeconvolve
? Here, I code in R to take a hands-on, simulation-based approach to prove to myself that normalization impacts the results of dimensionality reduction and clustering but not deconvolution with STdeconvolve
. I use a simulated spatial transcriptomics dataset with very obvious variation in terms of total genes detected per spatially resolved spot. I then perform dimensionality reduction and clustering analysis on both library-size normalized and unnormalized gene expression counts to find very different results. Likewise, I perform deconvolution analysis with STdeconvolve
on both the library-size normalized and unnormalized gene expression counts to find that the inferred cell-type proportions don’t change, suggesting that normalization is not necessary.
Follow along in the video or try out the code below for yourself!
Video
Code
#### simulate ST data
set.seed(0)
G <- 3
N <- 100
M <- 300
initmean <- 10
initvar <- 10
mat <- matrix(rnorm(N*M*G, initmean, initvar), M, N*G)
rownames(mat) <- paste0('gene', 1:M)
colnames(mat) <- paste0('cell', 1:(N*G))
ct <- factor(sapply(1:G, function(x) {
rep(paste0('ct', x), N)
}))
names(ct) <- colnames(mat)
## Visualize heatmap where each row is a gene, each column is a cell
## Column side color notes the cell-type
par(mfrow=c(1,1))
heatmap(mat,
Rowv=NA, Colv=NA,
col=colorRampPalette(c('blue', 'white', 'red'))(100),
scale="none",
ColSideColors=rainbow(G)[ct],
labCol=FALSE, labRow=FALSE)
set.seed(0)
upreg <- 100
upregvar <- 10
ng <- 100
diff <- lapply(1:G, function(x) {
diff <- rownames(mat)[(((x-1)*ng)+1):(((x-1)*ng)+ng)]
mat[diff, ct==paste0('ct', x)] <<-
mat[diff, ct==paste0('ct', x)] +
rnorm(ng, upreg, upregvar)
return(diff)
})
names(diff) <- paste0('ct', 1:G)
par(mfrow=c(1,1))
heatmap(mat,
Rowv=NA, Colv=NA,
col=colorRampPalette(c('blue', 'white', 'red'))(100),
scale="none",
ColSideColors=rainbow(G)[ct],
labCol=FALSE, labRow=FALSE)
range(mat)
## positive expression only
mat[mat < 0] <- 0
## make counts
mat <- round(mat)
spotpos <- cbind(unlist(lapply(1:30, function(i) rep(i,30))), 1:30)
rownames(spotpos) <- paste0('spot-', 1:nrow(spotpos))
colnames(spotpos) <- c('x','y')
dim(spotpos)
par(mfrow=c(1,1))
plot(spotpos, cex=1)
## mix 3 cell-types
ct1 <- names(ct)[ct == 'ct1']
ct2 <- names(ct)[ct == 'ct2']
ct3 <- names(ct)[ct == 'ct3']
nmix <- nrow(spotpos)/2
pct1 <- c(unlist(lapply(rev(1:nmix), function(i) i/nmix)), rep(0,nmix))
pct2 <- c(rep(0,nmix), unlist(lapply(1:nmix, function(i) i/nmix)))
pct3 <- 1-(pct1+pct2)
## Show proportion of each cell-type across spots
par(mfrow=c(3,1), mar=rep(2,4))
barplot(pct1, ylim=c(0,1), main='proportion ct1')
barplot(pct2, ylim=c(0,1), main='proportion ct2')
barplot(pct3, ylim=c(0,1), main='proportion ct3')
pct <- cbind(pct1, pct2, pct3)
rownames(pct) <- rownames(spotpos)
## Visualize as pie charts
STdeconvolve::vizAllTopics(pct, spotpos) +
ggplot2::guides(colour = "none")
## assume 10 cells per spot
ncells <- 10
## make gene expression matrix
spotmat <- do.call(cbind, lapply(1:nrow(spotpos), function(i) {
spotcells <- c(
sample(ct1, pct1[i]*ncells),
sample(ct2, pct2[i]*ncells),
sample(ct3, pct3[i]*ncells)
)
rowSums(mat[,spotcells])
}))
colnames(spotmat) <- rownames(spotpos)
## simulated ST data
head(pct)
head(spotmat)
dim(spotmat)
head(spotpos)
dim(spotpos)
#### tweaks to make normalization more prominent
tweak <- rownames(spotpos)[spotpos[,'y'] %in% seq(1,30,by=2)]
spotmattweak <- spotmat
spotmattweak[, tweak] <- spotmattweak[, tweak]*5
colSums(spotmattweak)
colSums(spotmat)
par(mfrow=c(1,1))
MERINGUE::plotEmbedding(spotpos, col=colSums(spotmattweak))
#### exploring impact of normalization
## what happens if we don't normalize?
## dimensionality reduction with PCA
## and clustering with kmeans
pcs <- prcomp(t(spotmattweak))
plot(pcs$x[,1:2], pch=16)
com <- kmeans(pcs$x[,1:3], centers = 3)
MERINGUE::plotEmbedding(pcs$x[,1:2], groups=com$cluster)
MERINGUE::plotEmbedding(spotpos, groups=com$cluster)
MERINGUE::plotEmbedding(pcs$x[,1:2], col=colSums(spotmattweak))
MERINGUE::plotEmbedding(spotpos, col=colSums(spotmattweak))
## what happens if we do normalize?
spotmattweaknorm <- t(t(spotmattweak)/colSums(spotmattweak))
spotmattweaknorm <- spotmattweaknorm*1e6
pcs2 <- prcomp(t(spotmattweaknorm))
plot(pcs2$x[,1:2], pch=16)
com2 <- kmeans(pcs2$x[,1:3], centers = 3)
MERINGUE::plotEmbedding(pcs2$x[,1:2], groups=com2$cluster)
MERINGUE::plotEmbedding(spotpos, groups=com2$cluster)
table(com$cluster, com2$cluster)
## deconvolution analysis
## no normalization
ldas <- fitLDA(t(as.matrix(spotmattweak)), Ks = 3)
optLDA <- optimalModel(models = ldas, opt = "min")
results <- getBetaTheta(optLDA, perc.filt = 0.05, betaScale = 1000)
deconProp <- results$theta
deconGexp <- results$beta
## visualize deconvolved cell-type proportions
vizAllTopics(deconProp, spotpos)
cor(pct, deconProp)
## with normalization
## requires rounding hack to get counts
## (not recommended because as you'll see we get the same answer anyway)
ldas2 <- fitLDA(t(as.matrix(round(spotmattweaknorm))), Ks = 3)
optLDA2 <- optimalModel(models = ldas2, opt = "min")
results2 <- getBetaTheta(optLDA2, perc.filt = 0.05, betaScale = 1000)
deconProp2 <- results2$theta
deconGexp2 <- results2$beta
## visualize deconvolved cell-type proportions
vizAllTopics(deconProp2, spotpos)
cor(pct, deconProp2)
cor(deconProp, deconProp2)
Other related posts
- Older
- Newer
RECENT POSTS
- Aligning 10X Visium spatial transcriptomics datasets using STalign with Reticulate in R on 05 November 2023
- Aligning single-cell spatial transcriptomics datasets simulated with non-linear disortions on 20 August 2023
- 10x Visium spatial transcriptomics data analysis with STdeconvolve in R on 29 May 2023
- Impact of normalizing spatial transcriptomics data in dimensionality reduction and clustering versus deconvolution analysis with STdeconvolve on 04 May 2023
- Aligning Spatial Transcriptomics Data With Stalign on 16 April 2023
- 3D animation of the brain in R on 08 November 2022
- Ethical Challenges in Biomedical Engineering - Data Collection, Analysis, and Interpretation on 15 October 2022
- I use R to (try to) figure out the cost of medical procedures by analyzing insurance data from the Transparency in Coverage Final Rule on 12 September 2022
- Annotating STdeconvolve Cell-Types with ASCT+B Tables on 30 August 2022
- Deconvolution vs Clustering Analysis: An exploration via simulation on 11 July 2022
TAGS
RELATED POSTS
- Aligning 10X Visium spatial transcriptomics datasets using STalign with Reticulate in R
- Aligning single-cell spatial transcriptomics datasets simulated with non-linear disortions
- 10x Visium spatial transcriptomics data analysis with STdeconvolve in R
- Aligning Spatial Transcriptomics Data With Stalign