Jun 28, 2018

Explore more posts
by Tags | by Date | View All

Keep up to date

Single Cell Clustering Comparison

So you aligned and quantified your single cell RNA-seq data. You QC-ed, normalized, and even batch corrected as needed. Now you want to identify transcriptional clusters. What clustering algorithm do you use?

Many specialized clustering methods for single cell RNA-seq data have been developed. Most come with sophisticated error modeling and other innovations. Still, they are generally grounded in some type of general k-means, graph-based, density-based, or hierarchical clustering.

So in this blog post, I will test out a few different general apporaches for identifying clusters in single cell RNA-seq data, namely: k-means, graph-based community detection, dbscan, and hierachical clustering. For demonstration, we will use the Patient B PBMC single cell RNA-seq data from 10X. PBMCs are nice because we have an expectation of what cell-types must be present and we know a number of markers for these cell-types (CD3E for T-cells, CD19 and CD20 for B-cells, etc). We can then compare the different clusters identified by various clustering approaches to what we know must be a cluster based on our knowledge of these cell markers.

Preparing the data

Let’s first quickly clean, normalize, and perform some dimensionality reduction on our PBMCs to get it into shape for clustering detection. We can then visualize a few of these known cell markers.

set.seed(0)

############# Sample data
library(MUDAN)
data(pbmcB)
## filter out poor genes and cells
cd <- cleanCounts(pbmcB, 
                  min.reads = 10, 
                  min.detected = 10, 
                  verbose=FALSE)
## CPM normalization
mat <- normalizeCounts(cd, 
                       verbose=FALSE) 
## variance normalize, identify overdispersed genes
matnorm.info <- normalizeVariance(mat, 
                                  details=TRUE, 
                                  verbose=FALSE,
                                  alpha=0.2) 
## log transform
matnorm <- log10(matnorm.info$mat+1) 
## dimensionality reduction on overdispersed genes
pcs <- getPcs(matnorm[matnorm.info$ods,], 
              nGenes=length(matnorm.info$ods), 
              nPcs=30, 
              verbose=FALSE) 
## get tSNE embedding 
emb <- Rtsne::Rtsne(pcs, 
                    is_distance=FALSE, 
                    perplexity=30, 
                    num_threads=parallel::detectCores(), 
                    verbose=FALSE)$Y 
rownames(emb) <- rownames(pcs)

Based on our prior knowledge of marker genes, we can already tell that there must be distinct clusters corresponding to B-cells, T-cells subtypes, monocytes, and so forth. Indeed, we can see that CD20 (MS4A1) expression marks a distinct group of cells, presumably B-cells. And similarly, CD3E marks our T-cells, and so forth.

par(mfrow=c(3,3), mar=rep(2,4))
markers <- c('MS4A1', 'CD3E', 'IL7R', 'CCR7', 'CD8A', 'FCGR3A', 'CD14', 'HLA-DRA', "XCL1")
invisible(lapply(markers, function(g) {
  # plot binarized expression (on or off)
  plotEmbedding(emb, colors=(cd[g,]>0)*1, 
              main=g, xlab=NA, ylab=NA, 
              verbose=FALSE) 
}))

We can now try various cluster detection approaches and see if the identified clusters match what we expect based on marker expression.

K-Means

First, we will use a simple k-means clustering approach on our PCs. With k-means clustering, a k must be provided a-priori to partition your cells into k groups. Thus, the results rely heavily on your choice of k. To be extra generous in our comparison, I tried a number of ks and picked the best (as least visually) one to plot.

## k-means
set.seed(0)
com.km <- kmeans(pcs, centers=10)$cluster
par(mfrow=c(1,1), mar=rep(2,4))
plotEmbedding(emb, com.km, 
              main='K-Means', xlab=NA, ylab=NA, 
              mark.clusters=TRUE, alpha=0.1, mark.cluster.cex=1,
              verbose=FALSE) 

Graph-based community detection

Next, we will build a graph based on the nearest neighbor relationships in PC space. We can then apply graph-based community detection, specifically the Infomap method, to identify putative clusters. There are a number of other graph-based community detection algorithms you can use (Louvain, Walktrap, etc) but all will require building a graph. Here, I elected to build a graph from the 30 nearest neighbors in PC space for each cell.

## graph-based community detection
set.seed(0)
com.graph <- MUDAN::getComMembership(pcs, 
                        k=30, method=igraph::cluster_infomap, 
                        verbose=FALSE) 
par(mfrow=c(1,1), mar=rep(2,4))
plotEmbedding(emb, com.graph, 
              main='Graph-based Community Detection', xlab=NA, ylab=NA, 
              mark.clusters=TRUE, alpha=0.1, mark.cluster.cex=1,
              verbose=FALSE) 

HDBSCAN

DBSCAN (or in this case, the hierarchical version) is nice in that it doesn’t require any expected number of clusters, such as k-mean’s k parameter, to be specified a priori. But I had a hard time getting sensible results using all PCs (too high dimensional) so I tried to reduce the dimensions considered to only PCS 1 to 5.

## Hierarchical DBSCAN
library(dbscan)
set.seed(0)
com.db <- hdbscan(pcs[,1:5], minPts=5)$cluster
names(com.db) <- rownames(pcs)
par(mfrow=c(1,1), mar=rep(2,4))
plotEmbedding(emb, com.db, 
              main='HDBSCAN', xlab=NA, ylab=NA, 
              mark.clusters=TRUE, alpha=0.1, mark.cluster.cex=1,
              verbose=FALSE)

Hierarchical clustering

Hierarchical clustering or, alternatively biclustering (not shown) that clusters both cells and genes, builds tree relationship among cells that can then be cut at various heights to form clusters. Rather than choosing a cut height, I will use dynamic tree cutting. Of course, the choice of distance metric and agglomeration method for hierarchical clustering will still impact results. Here, I am using a 1 minus correlation in PC space to define the distance between cells and the Ward agglomeration method.

## Hierarchical clustering with dynamic tree cutting
library(dynamicTreeCut)
set.seed(0)
d <- as.dist(1-cor(t(pcs)))
hc <- hclust(d, method='ward.D')
com.hc <- cutreeDynamic(hc, distM=as.matrix(d), deepSplit=4, verbose=FALSE)
names(com.hc) <- rownames(pcs)
par(mfrow=c(1,1), mar=rep(2,4))
plotEmbedding(emb, com.hc, 
              main='Hclust', xlab=NA, ylab=NA, 
              mark.clusters=TRUE, alpha=0.1, mark.cluster.cex=1,
              verbose=FALSE) 

In my own work, I am reaching the point where I must now consider not only the accuracy of a clustering algorithm but also its runtime and memory usage (primarily during software testing and development where I really value being able to iterate quickly). So we will also benchmark the runtime and memory usage of each cluster detection approach to get a sense of how well each method may scale as our datasets increase in size.

First, let’s benchmark runtime.

## try for different number of cells
ncells <- c(100, 200, 400, 800, 1600, 3200, 6400)

## k-means
rt.km <- sapply(ncells, function(vi) {
  start_time <- Sys.time()
  kmeans(pcs[1:vi,], centers=10)$cluster
  end_time <- Sys.time()
  return(end_time - start_time)
})

## graph-based community detection
rt.graph <- sapply(ncells, function(vi) {
  start_time <- Sys.time()
  getComMembership(pcs[1:vi,], 
                   k=30, method=igraph::cluster_infomap, 
                   verbose=FALSE) 
  end_time <- Sys.time()
  return(end_time - start_time)
})

## HDBSCAN
rt.db <- sapply(ncells, function(vi) {
  start_time <- Sys.time()
  hdbscan(pcs[1:vi,1:5], minPts=10)$cluster
  end_time <- Sys.time()
  return(end_time - start_time)
})

## Hclust w/ dynamic tree cutting
library(dynamicTreeCut)
rt.hc <- sapply(ncells, function(vi) {
  start_time <- Sys.time()
  d <- as.dist(1-cor(t(pcs[1:vi,])))
  hc <- hclust(d, method='ward.D')
  com.hc <- cutreeDynamic(hc, distM=as.matrix(d), deepSplit=4, verbose=FALSE)
  names(com.hc) <- rownames(pcs)[1:vi]
  end_time <- Sys.time()
  return(end_time - start_time)
})

We will make an interactive plot here using highcharter.

############# Highercharter
library(highcharter)
# Adapted from highcharter::export_hc() to not write to file
library(jsonlite)
library(stringr)
write_hc <- function(hc, name) {
  JS_to_json <- function(x) {
    class(x) <- "json"
    return(x)
  }
  hc$x$hc_opts <- rapply(object = hc$x$hc_opts, f = JS_to_json, 
                         classes = "JS_EVAL", how = "replace")
  js <- toJSON(x = hc$x$hc_opts, pretty = TRUE, auto_unbox = TRUE, 
               json_verbatim = TRUE, force = TRUE, null = "null", na = "null")
  js <- sprintf("$(function(){\n\t$('#%s').highcharts(\n%s\n);\n});", 
                name, js)
  return(js)
}
# Helper function to write javascript in Rmd
# Thanks so http://livefreeordichotomize.com/2017/01/24/custom-javascript-visualizations-in-rmarkdown/
send_hc_to_js <- function(hc, hcid){
  cat(
    paste(
      '<span id=\"', hcid, '\"></span>',
      '<script>',
      write_hc(hc, hcid),
      '</script>', 
      sep="")
  )
}

# Melt into appropriate format
df <- data.frame('graph'=rt.graph,
                 'km'=rt.km,
                 'db'=rt.db,
                 'hc'=rt.hc
                 )
dfm <- reshape2::melt(df)
dfm$ncell <- rep(ncells, 4)

# Plot
hcRT <- highchart() %>% 
  hc_add_series(dfm, "line", hcaes(x = ncell, y = value, group = variable)) %>%
  hc_title(text = 'Runtime Comparison') %>%
  hc_legend(align = "right", layout = "vertical") %>%
  hc_xAxis(title = list(text = 'number of cells')) %>%
  hc_yAxis(title = list(text = 'runtime in seconds'))

send_hc_to_js(hcRT, 'hcRT')

K-means is quite fast! Graph-based community detection, which requires building a graph, is currently somewhat slow, though that may just be my poor implementation of the graph-construction approach from nearest neighbors. Similarly for hierarchical clustering, there may be faster C-based algorithms for calculating the distance matrices that could speed things up.

We can do the same for memory usage.

library(pryr)
## graph-based community detection
mem.graph <- sapply(ncells, function(vi) {
  return(mem_change(
  com <- getComMembership(pcs[1:vi,], 
                   k=30, method=igraph::cluster_infomap, 
                   verbose=FALSE) 
  ))
})

## k-means
mem.km <- sapply(ncells, function(vi) {
  return(mem_change(
  com <- kmeans(pcs[1:vi,], centers=10)$cluster
  ))
})

## HDBSCAN
mem.db <- sapply(ncells, function(vi) {
  return(mem_change(
  com <- hdbscan(pcs[1:vi,1:5], minPts=5)$cluster
  ))
})

## Hclust w/ dynamic tree cutting
mem.hc <- sapply(ncells, function(vi) {
  return(
    mem_change(d <- as.dist(1-cor(t(pcs[1:vi,])))) +
    mem_change(hc <- hclust(d, method='ward.D')) +
    mem_change(com <- cutreeDynamic(hc, distM=as.matrix(d), deepSplit=4, verbose=FALSE))
  )
})

# Melt into appropriate format
df <- data.frame('graph'=mem.graph,
                 'km'=mem.km,
                 'db'=mem.db,
                 'hc'=mem.hc
                 )
dfm <- reshape2::melt(df)
dfm$ncell <- rep(ncells, 4)

# Plot
hcMem <- highchart() %>% 
  hc_add_series(dfm, "line", hcaes(x = ncell, y = value, group = variable)) %>%
  hc_title(text = 'Memory Comparison') %>%
  hc_legend(align = "right", layout = "vertical") %>%
  hc_xAxis(title = list(text = 'number of cells')) %>%
  hc_yAxis(title = list(text = 'memory usage (log)'), type = "logarithmic")

send_hc_to_js(hcMem, 'hcMem')

Note that the y-axis is on the log scale! You definitely wouldn’t want to use hierachical clustering if you have millions of cells!

Conclusion

Judge for yourself! Which method did ‘better’? To really judge, we would want to check which method is better at identifying ‘stable’ clusters with truely differentially expressed genes. Only by looking for differentially expressed genes distinguishing each cluster would we be able to clarify which method properly split up the monocytes (marked by CD14) into putative subtypes. While k-means split monocytes into groups annotated 4, 6, and 7, graph-based community detection split into groups annotated as 4 and 5 which are not consistent with k-means, while the other methods did not identify any subtypes. DBSCAN appears to be better suited for identifying super small/rare populations such as groups marked 1, 8, and 9 as subtypes of B-cells (marked by CD20/MS4A1) but fails to distinguish between the large known T-cell subtypes (CD4, CD8 for example) at least using current settings. Hiearchical clustering is just too slow and memory intensive to be worthwhile in my opinion moving forward unless cells are downsampled or merged prior to clustering to reduce N.

What happens if we use different ks in our various cluster detection approaches? Are some approaches more sensitive to parameter choices than others? What if instead of PC space, we do our community detection in the original expression space (using the entire normalized expression matrix instead of PCs)? Try it out for yourself!