Stability Testing
Feb 28, 2018
Stability testing: How do you know whether your single-cell clusters are ‘real’?
In single-cell RNA-seq analysis, we are often looking to identify transcriptional subpopulations that may be interpreted as distinct cell-types and subtypes or cell-states. For each cell, we measure its expression of thousands of genes, which we can use as features to cluster these cells into transcriptionally-similar clusters. But, one questions that keeps popping up in my mind:
How do you know whether your clusters are ‘real’? How do you know you haven’t ‘over-clustered’ your data?
In this blog post, I will show a few examples with both simulated and real data to highlight why we should care about this issue of over-clustering and provide a strategy to address over-clustering available in the MUDAN
package I am currently developing.
Simulated data
First, we will explore the issue of over-clustering and how to address over-clustering using simulated data. So let’s first simulate some data!
#' Simulate gene expression counts counts
#' @param G number of groups
#' @param N number of cells per group
#' @param M number of genes
#' @param initmean default mean gene expression
#' @param initvar default gene expression variance
#' @param upreg gene expression increase for upregulated genes
#' @param upregvar gene expression variance for upregulated genes
#' @param ng number of upregulated genes per group
simulate.data <- function(G=5, N=30, M=100, initmean=0, initvar=10, upreg=10, upregvar=10, ng=20, seed=0) {
set.seed(seed)
mat <- matrix(rnorm(N*M*G, initmean, initvar), M, N*G)
rownames(mat) <- paste0('gene', 1:M)
colnames(mat) <- paste0('cell', 1:(N*G))
group <- factor(sapply(1:G, function(x) rep(paste0('group', x), N)))
names(group) <- colnames(mat)
diff <- lapply(1:G, function(x) {
diff <- rownames(mat)[(((x-1)*ng)+1):(((x-1)*ng)+ng)]
mat[diff, group==paste0('group', x)] <<- mat[diff, group==paste0('group', x)] + rnorm(ng, upreg, upregvar)
return(diff)
})
names(diff) <- paste0('group', 1:G)
mat[mat<0] <- 0
mat <- round(mat)
return(list(cd=mat, group=group))
}
data <- simulate.data()
We can visualize our simulated data as a heatmap. Check out my previous post on highcharter
in case you are interested in how this plot is generated and embedded in this blog post. (You can also view the page source). Obviously, we simulated for and can see 5 distinct transcriptional subpopulations or clusters.
library(highcharter)
hc1 <- hchart(data$cd, "heatmap", hcaes(x = gene, y = data$group, value = value)) %>%
hc_colorAxis(stops = color_stops(10, rev(RColorBrewer::brewer.pal(10, "RdBu")))) %>%
hc_title(text = "Simulated Gene Expression") %>%
hc_legend(layout = "vertical", verticalAlign = "bottom", align = "right", valueDecimals = 0) %>%
hc_plotOptions(series = list(dataLabels = list(enabled = FALSE)))