Graph Based Community Detection For Clustering Analysis
Sep 13, 2017
Graph-based community detection for clustering analysis in R
Introduction
In single cell analyses, we are often trying to identify groups of transcriptionally similar cells, which we may interpret as distinct cell types or cell states. We may also be interested in identifying groups of transcriptionally coordinated genes, which we may interpret as functional regulatory modules or pathways. In either case, we are looking at some high dimensional data and trying to identify clusters.
We can also think of these clusters as communities in a graph. Abstractly, our graph would just be composed of all our cells as vertices. And the vertices would, in theory, be connected to each other if the cells were of the same cell type. Community detection in graphs identify groups of vertices with higher probability of being connected to each other than to members of other groups.
In this tutorial, I will use simulated and public data to demonstrate how you can apply graph-based community detection to identify cell types.
Simulation
First, let’s simulate some data. We will simulate 5 very clearly distinct cell types and see if we can use graph-based community detection to correctly recover the true cell type annotations.
G=5 ## number of cell types
N=100 ## number of cells per cell type
M=1000 ## number of genes
initmean=0 ## baseline mean expression
initvar=10 ## baseline expression variance
upreg=10 ## degree of cell-type specific upregulation
upregvar=10 ## expression variance of upregulated genes
ng=100 ## number of upregulated genes per cell type
seed=0 ## random seed
set.seed(seed)
mat <- matrix(rnorm(N*M*G, initmean, initvar), M, N*G)
rownames(mat) <- paste0('gene', 1:M)
colnames(mat) <- paste0('cell', 1:(N*G))
group <- factor(sapply(1:G, function(x) rep(paste0('group', x), N)))
names(group) <- colnames(mat)
diff <- lapply(1:G, function(x) {
diff <- rownames(mat)[(((x-1)*ng)+1):(((x-1)*ng)+ng)]
mat[diff, group==paste0('group', x)] <<- mat[diff, group==paste0('group', x)] + rnorm(ng, upreg, upregvar)
return(diff)
})
names(diff) <- paste0('group', 1:G)
mat[mat<0] <- 0
mat <- t(t(mat)/colSums(mat))
mat <- log10(mat*1e6+1)
heatmap(mat, Rowv=NA, Colv=NA, col=colorRampPalette(c('blue', 'white', 'red'))(100), scale="none", ColSideColors=rainbow(G)[group], labCol=FALSE, labRow=FALSE)
To apply graph-based community detection algorithms, we will need to represent our data as a graph. To do this, we will use the RANN
to identify approximate nearest neighbors.
library(RANN)
knn.info <- RANN::nn2(t(mat), k=30)
The result is a list containing a matrix of neighbor relations and another matrix of distances. We will convert this neighbor relation matrix into an adjacency matrix.
knn <- knn.info$nn.idx
adj <- matrix(0, ncol(mat), ncol(mat))
rownames(adj) <- colnames(adj) <- colnames(mat)
for(i in seq_len(ncol(mat))) {
adj[i,colnames(mat)[knn[i,]]] <- 1
}
Now, we can represent the adjacancy matrix as a graph.
library(igraph)
g <- igraph::graph.adjacency(adj, mode="undirected")
g <- simplify(g) ## remove self loops
V(g)$color <- rainbow(G)[group[names(V(g))]] ## color nodes by group
plot(g, vertex.label=NA)
In visualizing the graph, we can clearly see 5 very distinct communities corresponding to our simulated cell types. If we can see these distinct communities, surely a graph-based community detection method such as the walktrap algorithm can detect them too.
km <- igraph::cluster_walktrap(g)
## community membership
com <- km$membership
names(com) <- km$names
table(com, group)
## group
## com group1 group2 group3 group4 group5
## 1 100 0 0 0 0
## 2 0 100 0 0 0
## 3 0 0 0 0 100
## 4 0 0 0 100 0
## 5 0 0 100 0 0
Do our identified communities match our simulated groups? Indeed they do! Community 1 corresponds to our group1 of cells, community 2 to our group2, community 3 to our group5, and so forth. However, this is a perfect simulated world. What happens when we apply this to real world data?
10X Immune Cells
Consider this public dataset from 10X genomics.
library(Matrix)
load('reference_10x.RData')
dim(reference.cd)
## [1] 32738 2140
table(ct)
## ct
## bcells cytotoxict memoryt monocytes naivecytotoxic
## 330 148 252 129 43
## naivet nk regulatoryt thelper
## 52 769 226 191
Let’s clean up this data a little. We’ll remove genes seen in fewer than 100 cells (a bit stringent, but it’ll speed things up for this demonstration). Then we’ll normalize and log transform.
vi <- Matrix::rowSums(reference.cd>0)>100
table(vi)
## vi
## FALSE TRUE
## 27704 5034
mat <- reference.cd[vi,]
mat <- t(t(mat)/colSums(mat))
mat <- log10(mat*1e6+1)
mat[1:5,1:5]
## 5 x 5 Matrix of class "dgeMatrix"
## bcellsAAAGATCTCCGAAT-1 bcellsAAAGCAGAGTACGT-1
## NOC2L 0.00000 0
## ISG15 2.43014 0
## TNFRSF18 2.43014 0
## TNFRSF4 0.00000 0
## SDF4 0.00000 0
## bcellsAAATCTGACCCTTG-1 bcellsAACAGCACTGAACC-1
## NOC2L 0.000000 0.000000
## ISG15 2.461924 2.432584
## TNFRSF18 0.000000 2.432584
## TNFRSF4 0.000000 0.000000
## SDF4 2.461924 2.432584
## bcellsAACATTGATGCTTT-1
## NOC2L 2.799293
## ISG15 3.099978
## TNFRSF18 0.000000
## TNFRSF4 0.000000
## SDF4 0.000000
Now let’s run our graph-based community detection to see if we recover the true cell type annotations.
## identify KNN
library(RANN)
knn.info <- RANN::nn2(t(mat), k=30)
## convert to adjacancy matrix
knn <- knn.info$nn.idx
adj <- matrix(0, ncol(mat), ncol(mat))
rownames(adj) <- colnames(adj) <- colnames(mat)
for(i in seq_len(ncol(mat))) {
adj[i,colnames(mat)[knn[i,]]] <- 1
}
## convert to graph
library(igraph)
g <- igraph::graph.adjacency(adj, mode="undirected")
g <- simplify(g) ## remove self loops
## identify communities
km <- igraph::cluster_walktrap(g)
com <- km$membership
names(com) <- km$names
## compare to annotations
table(com, ct)
## ct
## com bcells cytotoxict memoryt monocytes naivecytotoxic naivet nk
## 1 0 5 20 1 0 1 0
## 2 0 0 1 110 0 2 0
## 3 330 0 0 17 1 6 0
## 4 0 137 231 1 42 43 3
## 5 0 6 0 0 0 0 766
## ct
## com regulatoryt thelper
## 1 16 15
## 2 3 8
## 3 0 0
## 4 207 168
## 5 0 0
Not perfect but not bad! Community 3 is predominantly b cells, community 4 is cytotoxic t cells, community 2 is monocytes, and so forth. There are lots of other community detection algorithms as well that we can try like infomap.
km <- igraph::cluster_infomap(g)
com <- km$membership
names(com) <- km$names
table(com, ct)
## ct
## com bcells cytotoxict memoryt monocytes naivecytotoxic naivet nk
## 1 0 140 217 0 42 34 3
## 2 0 3 0 0 0 0 766
## 3 330 0 0 2 0 0 0
## 4 0 0 1 110 0 2 0
## 5 0 5 20 0 0 0 0
## 6 0 0 14 0 0 9 0
## 7 0 0 0 17 1 7 0
## ct
## com regulatoryt thelper
## 1 208 165
## 2 0 0
## 3 0 0
## 4 3 8
## 5 14 13
## 6 1 4
## 7 0 1
And many many more. Check out the igraph
package for more community detection algorithms available through the package: https://www.rdocumentation.org/packages/igraph/
Now it’s time for you to try it out for yourself. What if we change k in the k-nearest neighbor identification? What if we use fewer features/genes? What if we try a different community detection algorithm? What if we transpose our matrices and try to cluster genes instead of cells?
- Older
- Newer
RECENT POSTS
- Impact of normalizing spatial transcriptomics data in dimensionality reduction and clustering versus deconvolution analysis with STdeconvolve on 04 May 2023
- Aligning Spatial Transcriptomics Data With Stalign on 16 April 2023
- 3D animation of the brain in R on 08 November 2022
- Ethical Challenges in Biomedical Engineering - Data Collection, Analysis, and Interpretation on 15 October 2022
- I use R to (try to) figure out the cost of medical procedures by analyzing insurance data from the Transparency in Coverage Final Rule on 12 September 2022
- Annotating STdeconvolve Cell-Types with ASCT+B Tables on 30 August 2022
- Deconvolution vs Clustering Analysis: An exploration via simulation on 11 July 2022
- Coloring SVGs in R on 17 June 2022
- Deconvolution vs Clustering Analysis for Multi-cellular Pixel-Resolution Spatially Resolved Transcriptomics Data on 03 May 2022
- Exploring UMAP parameters in visualizing single-cell spatially resolved transcriptomics data on 19 January 2022
TAGS
RELATED POSTS
- Impact of normalizing spatial transcriptomics data in dimensionality reduction and clustering versus deconvolution analysis with STdeconvolve
- Aligning Spatial Transcriptomics Data With Stalign
- 3D animation of the brain in R
- I use R to (try to) figure out the cost of medical procedures by analyzing insurance data from the Transparency in Coverage Final Rule