Stability Testing
Stability testing: How do you know whether your single-cell clusters are ‘real’?
In single-cell RNA-seq analysis, we are often looking to identify transcriptional subpopulations that may be interpreted as distinct cell-types and subtypes or cell-states. For each cell, we measure its expression of thousands of genes, which we can use as features to cluster these cells into transcriptionally-similar clusters. But, one questions that keeps popping up in my mind:
How do you know whether your clusters are ‘real’? How do you know you haven’t ‘over-clustered’ your data?
In this blog post, I will show a few examples with both simulated and real data to highlight why we should care about this issue of over-clustering and provide a strategy to address over-clustering available in the MUDAN
package I am currently developing.
Simulated data
First, we will explore the issue of over-clustering and how to address over-clustering using simulated data. So let’s first simulate some data!
#' Simulate gene expression counts counts
#' @param G number of groups
#' @param N number of cells per group
#' @param M number of genes
#' @param initmean default mean gene expression
#' @param initvar default gene expression variance
#' @param upreg gene expression increase for upregulated genes
#' @param upregvar gene expression variance for upregulated genes
#' @param ng number of upregulated genes per group
simulate.data <- function(G=5, N=30, M=100, initmean=0, initvar=10, upreg=10, upregvar=10, ng=20, seed=0) {
set.seed(seed)
mat <- matrix(rnorm(N*M*G, initmean, initvar), M, N*G)
rownames(mat) <- paste0('gene', 1:M)
colnames(mat) <- paste0('cell', 1:(N*G))
group <- factor(sapply(1:G, function(x) rep(paste0('group', x), N)))
names(group) <- colnames(mat)
diff <- lapply(1:G, function(x) {
diff <- rownames(mat)[(((x-1)*ng)+1):(((x-1)*ng)+ng)]
mat[diff, group==paste0('group', x)] <<- mat[diff, group==paste0('group', x)] + rnorm(ng, upreg, upregvar)
return(diff)
})
names(diff) <- paste0('group', 1:G)
mat[mat<0] <- 0
mat <- round(mat)
return(list(cd=mat, group=group))
}
data <- simulate.data()
We can visualize our simulated data as a heatmap. Check out my previous post on highcharter
in case you are interested in how this plot is generated and embedded in this blog post. (You can also view the page source). Obviously, we simulated for and can see 5 distinct transcriptional subpopulations or clusters.
library(highcharter)
hc1 <- hchart(data$cd, "heatmap", hcaes(x = gene, y = data$group, value = value)) %>%
hc_colorAxis(stops = color_stops(10, rev(RColorBrewer::brewer.pal(10, "RdBu")))) %>%
hc_title(text = "Simulated Gene Expression") %>%
hc_legend(layout = "vertical", verticalAlign = "bottom", align = "right", valueDecimals = 0) %>%
hc_plotOptions(series = list(dataLabels = list(enabled = FALSE)))
Now, without peaking at the true underlying group annotations, we can to identify transcriptional subpopulations from this data. We generally need to perform some type of variance normalization and dimensionality reduction. However, since this simulated data does not have any real variance-magnitude dependencies in need of variance normalization to correct, and since its dimensions are already quite small, we will skip both the variance normalization and dimensionality reduction steps to jump right to the graph-based community detection with the infomap algorithm and 2D visualization with tSNE.
We demonstration purposes, we will severely over-cluster our data.
library(MUDAN)
# Graph-based community detection (way overcluster by using smallest possible k)
com <- getComMembership(t(data$cd), k=2, method=igraph::cluster_infomap)
# Get 2D tSNE embedding
emb <- Rtsne::Rtsne(t(data$cd), is_distance=FALSE, perplexity=30, verbose=FALSE)$Y
rownames(emb) <- colnames(data$cd)
We can compare our true group annotations from our simulation with our over-clustered annotations.
df <- data.frame(emb, data$group, rownames(emb)); colnames(df) <- c('x', 'y', 'group', 'name')
hc2 <- hchart(df, "scatter", hcaes(x, y, group=group, name=name)) %>%
hc_title(text = "True Group Annotations") %>%
hc_tooltip(headerFormat = "",
pointFormat = "<b>group</b>: {series.name} <br> <b>cell</b>: {point.name}")
df <- data.frame(emb, com, rownames(emb)); colnames(df) <- c('x', 'y', 'group', 'name')
hc3 <- hchart(df, "scatter", hcaes(x, y, group=group, name=name)) %>%
hc_title(text = "Over-clustered Group Annotations") %>%
hc_tooltip(headerFormat = "",
pointFormat = "<b>group</b>: {series.name} <br> <b>cell</b>: {point.name}")
Looking at our over-clustered results, we may be tempted to claim that we have 30 different transcriptional subpopulations in our dataset. Wow! Of course, here, we know there are truly only 5 subpopulations. But with real data, how can we tell?
Ask any biologist, and they will likely want to know:
What are the markers for each subpopulation? How can I FACs for it?
And indeed, it is this type of thinking that motivates our stability analysis. We can ask our data:
What are the differentially expressed genes marking each of our identified transcriptional subpopulations?
If there are no significantly differentially expressed genes between two transcriptional subpopulations, then we have no reason to believe that they are truly distinct. Therefore, they should be merged.
To perform this merging in a systematic manner, we will first contruct a dendrogram relating the transcriptional subpopulations. We will then traverse this dendrogram and perform recursive merges until all leaves of the dendrogram are transcriptionally distinct and stable. Specifically, we will require at least 5 significantly differentially expressed (Z>1.96) genes between each transcriptional subpopulations we compare.
#' Iterative merging of clusters along tree until stable
#' (TODO: get rid of awful use of globals I know)
#'
#' @param cd Counts matrix for differential expression analysis. Rows are genes. Columns are cells.
#' @param com Community/group annotations for cells
#' @param matnorm Normalized gene expression matrix for building group relationship tree. Rows are genes. Columns are cells.
#' @param z.threshold Z-score threshold for identifying significantly differentially expressed genes
#' @param hclust.method Hierarchical clustering method used to construct relationship tree
#' @param min.group.size Minimum group size for stable cluster
#' @param min.diff.genes Minimum number of significantly differentially expressed genes that must be identified or else groups will be merged
#' @param max.iter Maximum number of iterations. Will end earlier if convergence reached.
#' @param removeGenes Regex expression matching genes you want to remove such as ribosomal or mitochondrial genes.
#' @param plot Whether to plot intermediate plots (hierarchical clustering dendrograms)
#' @param verbose Verbosity
#'
getStableClusters <- function(cd, com, matnorm, z.threshold=3, hclust.method='ward.D', min.group.size=10, min.diff.genes=nrow(cd)*0.005, max.iter=10, removeGenes = NULL, verbose=TRUE, plot=FALSE) {
# get rid of very small groups
if(min.group.size>1) {
com[com %in% levels(com)[table(com)<min.group.size]] <- NA
}
com <- as.factor(com[colnames(cd)])
# helper function to compute differentially expressed genes between two branches of dendrogram
compare <- function(dend) {
# create branch annotations
g1 <- labels(dend[[1]])
g2 <- labels(dend[[2]])
incom <- names(com)[com %in% g1]
outcom <- names(com)[com %in% g2]
com.sub <- factor(c(
rep(paste0(g1, collapse="."), length(incom)),
rep(paste0(g2, collapse="."), length(outcom))))
names(com.sub) <- c(incom, outcom)
# get differentially expressed genes between the two branches
dg <- getDifferentialGenes(cd[, names(com.sub)], com.sub, verbose=verbose)
# filter for only significantly upregulated genes for each branch
dg.sig <- lapply(dg, function(x) {
x <- x[x$Z>z.threshold,]
x <- na.omit(x)
return(x)
})
if(!is.null(removeGenes)) {
dg.sig <- lapply(dg.sig, function(dg) {
dg <- dg[!grepl(removeGenes, rownames(dg)), ] ## exclude ribosomal and mitochondrial
})
}
return(dg.sig)
}
# recursively function to trace dendrogram
pv.recur <- function(dend) {
# compares leaves of tree (groups)
if(is.leaf(dend[[1]]) & is.leaf(dend[[2]])) {
g1 <- paste0(labels(dend[[1]]), collapse=":")
g2 <- paste0(labels(dend[[2]]), collapse=":")
if(verbose) {
print(paste0('Comparing ', g1, ' and ', g2))
}
# get differentially expressed genes between leaves
dg.sig <- compare(dend)
if(verbose) {
print('Differential genes found: ')
print(sapply(dg.sig, length))
}
# if insufficient number of marker genes distinguishing two leaves, merge
if(length(unlist(dg.sig)) < min.diff.genes) {
if(verbose) {
print(paste0('Merging ', g1, ' and ', g2))
}
com.fin[com.fin==g1] <<- paste0(g1,'.', g2)
com.fin[com.fin==g2] <<- paste0(g1,'.', g2)
}
pv.sig.all[[g1]] <<- dg.sig
pv.sig.all[[g2]] <<- dg.sig
} else {
# if only one is leaf, compare to entire other branch
if(is.leaf(dend[[1]])) {
g1 <- paste0(labels(dend[[1]]), collapse=".")
g2 <- paste0(labels(dend[[2]]), collapse=".")
if(verbose) {
print(paste0('Comparing ', g1, ' and ', g2))
}
dg.sig <- compare(dend)
if(verbose) {
print('Differential genes found: ')
print(sapply(dg.sig, nrow))
}
# if insufficient number of marker genes for leaf, set to NA
if(nrow(dg.sig[[labels(dend[[1]])]]) < min.diff.genes) {
if(verbose) {
print(paste0('Cannot distinguish ', g1, ' and ', g2))
}
com.fin[com.fin==g1] <<- NA
}
pv.sig.all[[g1]] <<- dg.sig
} else {
# continue tree traversal
pv.recur(dend[[1]])
}
if(is.leaf(dend[[2]])) {
g1 <- paste0(labels(dend[[1]]), collapse=".")
g2 <- paste0(labels(dend[[2]]), collapse=".")
if(verbose) {
print(paste0('Comparing ', g1, ' and ', g2))
}
dg.sig <- compare(dend)
if(verbose) {
print('Differential genes found: ')
print(sapply(dg.sig, nrow))
}
if(nrow(dg.sig[[labels(dend[[2]])]]) < min.diff.genes) {
if(verbose) {
print(paste0('Cannot distinguish ', g1, ' and ', g2))
}
com.fin[com.fin==g2] <<- NA
}
pv.sig.all[[g2]] <<- dg.sig
} else {
# continue tree traversal
pv.recur(dend[[2]])
}
}
}
i <- 1 # counter
repeat {
com <- factor(com)
# average within groups
mat.summary <- do.call(cbind, lapply(levels(com), function(ct) {
cells <- which(com==ct)
if(length(cells) > 1) {
Matrix::rowMeans(matnorm[, cells])
} else {
matnorm[,cells]
}
}))
colnames(mat.summary) <- levels(com)
# cluster groups
hc <- hclust(dist(t(mat.summary)), method=hclust.method)
if(plot) { plot(hc) }
dend <- as.dendrogram(hc)
# globals
com.fin <- as.character(com)
names(com.fin) <- names(com)
pv.sig.all <- list()
# traverse tree by recursion
pv.recur(dend)
# test if converged
if(sum(com==com.fin, na.rm=TRUE)>=min(sum(!is.na(com)), sum(!is.na(com.fin)))) {
# compute one last time
com.fin <- factor(com.fin)
mat.summary <- do.call(cbind, lapply(levels(com.fin), function(ct) {
cells <- which(com.fin==ct)
if(length(cells) > 1) {
Matrix::rowMeans(matnorm[, cells])
} else {
matnorm[,cells]
}
}))
colnames(mat.summary) <- levels(com.fin)
hc <- hclust(dist(t(mat.summary)), method=hclust.method)
if(plot) { plot(hc) }
dend <- as.dendrogram(hc)
## exit
break
}
if(i>max.iter) {
break
}
com <- com.fin
i <- i+1
}
return(list(com=com.fin, pv.sig=pv.sig.all, hc=hc, mat.summary=mat.summary))
}
Now let’s actually apply this function to our over-clustered simulated data.
# Get stable clusters
comA <- getStableClusters(data$cd, com, data$cd, z.threshold=1.96, min.group.size=1, min.diff.genes=5, plot=FALSE, verbose=FALSE)
# Annotate cells that were from unstable clusters with stable annotations
# You can also do this with nearest neighbor voting or just a different classifier
# TODO: do this within getStableClusters ie. when leaf is not stable compared to branch, just label unstable cells with most similar leaf in branch
pv.sig.genes <- unique(unlist(lapply(comA$pv.sig, function(x) {
unique(unlist(lapply(x, rownames)))
})))
com.sub <- na.omit(comA$com)
df.sub <- data.frame(celltype=com.sub, t(data$cd[pv.sig.genes, names(com.sub)]))
model <- MASS::lda(celltype ~ ., data=df.sub)
model.output <- predict(model, data.frame(t(data$cd)))
com.all.fin <- model.output$class
names(com.all.fin) <- colnames(data$cd)
df <- data.frame(emb, com.all.fin, rownames(emb)); colnames(df) <- c('x', 'y', 'group', 'name')
hc4 <- hchart(df, "scatter", hcaes(x, y, group=group, name=name)) %>%
hc_title(text = "Stable Group Annotations") %>%
hc_tooltip(headerFormat = "",
pointFormat = "<b>group</b>: {series.name} <br> <b>cell</b>: {point.name}")
This looks much better! Much more like our simulated ground-truth! Sure, you may think that in this particular dummy example, you could’ve just looked at the tSNE embedding and tell that we’ve overclustered the data. But with real data, things can often be a lot more ambiguous.
10X PBMC data from Patient B
So let’s test out some real data. We will subsample 1000 cells from the PBMC dataset from 10X Genomics. It is provided as a part of the MUDAN
package.
library(MUDAN)
data(pbmcB)
myMudanObject <- Mudan$new('pbmcB', pbmcB[, 1:1000], verbose=FALSE)
myMudanObject$normalizeCounts() # CPM normalization
myMudanObject$normalizeVariance() # variance normalization
myMudanObject$getPcs(nPcs=50, maxit=1000) # dimensionality reduction
myMudanObject$getStandardEmbedding() # tSNE embedding
# graph-based community detection
# overcluster with small k
myMudanObject$getComMembership(communityName='Infomap',
communityMethod=igraph::cluster_infomap, k=8)
# get transcriptionally stable clusters only
# ignore ribosomal and mitochondrial genes
myMudanObject$getStableClusters(communityName='Infomap',
removeGenes='^RP|^MT',
min.diff.genes=5, z.threshold=1.96)
df <- data.frame(myMudanObject$emb$PCA,
myMudanObject$com$Infomap$ALL,
rownames(myMudanObject$emb$PCA))
colnames(df) <- c('x', 'y', 'group', 'name')
hc5 <- hchart(df, "scatter", hcaes(x, y, group=group, name=name)) %>%
hc_title(text = "Over-clustered Group Annotations") %>%
hc_tooltip(headerFormat = "",
pointFormat = "<b>group</b>: {series.name} <br> <b>cell</b>: {point.name}")
df <- data.frame(myMudanObject$emb$PCA,
myMudanObject$com$Infomap$STABLE,
rownames(myMudanObject$emb$PCA))
colnames(df) <- c('x', 'y', 'group', 'name')
hc6 <- hchart(df, "scatter", hcaes(x, y, group=group, name=name)) %>%
hc_title(text = "Stable Group Annotations") %>%
hc_tooltip(headerFormat = "",
pointFormat = "<b>group</b>: {series.name} <br> <b>cell</b>: {point.name}")
We can also see what differentially expressed genes / markers were identified for each of our stable subpopulations.
## average within population
hc <- myMudanObject$hc$Infomap
mn <- myMudanObject$mat.summary$Infomap[,hc$labels[hc$order]]
# plot significantly differentially expressed genes only
pv.sig <- myMudanObject$pv.sig$Infomap[hc$labels[hc$order]]
pv.sig.genes <- unique(unlist(lapply(names(pv.sig), function(ct) {
rownames(pv.sig[[ct]][[ct]]) # upregulated only in group of interest
})))
mn <- mn[pv.sig.genes,]
mn <- t(scale(t(mn)))
hc7 <- hchart(mn, "heatmap", hcaes(x = gene, y = data$group, value = value)) %>%
hc_colorAxis(stops = color_stops(10, rev(RColorBrewer::brewer.pal(10, "RdBu")))) %>%
hc_title(text = "Differentially Expressed Genes") %>%
hc_legend(layout = "vertical", verticalAlign = "bottom", align = "right", valueDecimals = 0) %>%
hc_plotOptions(series = list(dataLabels = list(enabled = FALSE)))
Discussion
In this blog post, we saw how over-clustering may lead us to believe that there are more transcriptionally disinct subpopulations in our data than in reality. We used a recursive differential gene expression analysis-based tree-traversal to identify stable transcriptional subpopulations, under the assumption that true transcriptional subpopulations must be sufficiently transcriptionally distinct from other true transcriptional subpopulations.
To define “sufficiently transcriptionally disinct”, we required a minimum of 5 significantly differentially expressed genes (with significance as abs(Z-score) > 1.96 ie. p-value < 0.05). However, is this a reasonable criteria? Why should we require 5 genes with absolute Z-scores > 1.96 and not 10 gene with Z-score greater than 3? Or maybe a different statistic altogether? Should the genes all necessarily impact different pathways? What about power calculations? Should the number and threshold for significantly differentially expressed genes vary depending on the number of cells in the subpopulations?
And of course, getting back to the big picture question of whether a single-cell cluster is ‘real’, just because we fail to find significantly differentially expressed genes, doesn’t always mean these clusters do not exhibit phenotypic or functional differences. Perhaps we were underpowered to identify significant differences. Perhaps there are other post-transcriptional modifications or spatial localization patterns that we are unaware of that could distinguish these cell subtypes. But as we are dealing with transcriptome measurements, we have to make do with what these measurements tell us.
If you have any ideas and are interesting in tinkering, feel free to make a fork or post a suggestion on the MUDAN
package.
- Older
- Newer
Recent Posts
- Using AI to find heterogeneous scientific speakers on 04 November 2024
- The many ways to calculate Moran's I for identifying spatially variable genes in spatial transcriptomics data on 29 August 2024
- Characterizing spatial heterogeneity using spatial bootstrapping with SEraster on 23 July 2024
- I use R to (try to) figure out which hospital I should go to for shoppable medical services by comparing costs through analyzing Hospital Price Transparency data on 22 April 2024
- Cross modality image alignment at single cell resolution with STalign on 11 April 2024
Related Posts
- Using AI to find heterogeneous scientific speakers
- The many ways to calculate Moran's I for identifying spatially variable genes in spatial transcriptomics data
- Characterizing spatial heterogeneity using spatial bootstrapping with SEraster
- I use R to (try to) figure out which hospital I should go to for shoppable medical services by comparing costs through analyzing Hospital Price Transparency data