Visually Summarizing Differential Gene Expression
Apr 25, 2018
Summarizing differential gene expression in large single cell datasets
So you’re analyzing a large single cell dataset. After you identify transcriptionally distinct clusters, you identify a set of differentially expressed genes marking each cluster. There could be hundreds of significantly upregulated genes in each cluster and hundreds of cells per cluster. How do you visually peruse through all this data?
library(MUDAN)
## load built in 10X pbmcA dataset
data(pbmcA)
pbmcA <- as.matrix(pbmcA)
cd <- cleanCounts(pbmcA, min.reads = 10, min.detected = 10, verbose=FALSE)
mat <- normalizeCounts(cd, verbose=FALSE)
matnorm.info <- normalizeVariance(mat, details=TRUE, verbose=FALSE)
matnorm <- log10(matnorm.info$mat+1)
pcs <- getPcs(matnorm[matnorm.info$ods,], nGenes=length(matnorm.info$ods), nPcs=30, verbose=FALSE)
d <- dist(pcs, method='man')
emb <- Rtsne::Rtsne(d, is_distance=TRUE, perplexity=50, num_threads=parallel::detectCores(), verbose=FALSE)$Y
rownames(emb) <- rownames(pcs)
com <- getComMembership(pcs, k=30, method=igraph::cluster_infomap, verbose=FALSE)
dg <- getDifferentialGenes(cd, com)
plotEmbedding(emb, com, xlab=NA, ylab=NA, mark.clusters=TRUE, alpha=0.1, mark.cluster.cex=0.5, verbose=FALSE)

Plotting all significantly differentially expressed genes in all cells is too messy. Plus, each cluster may have very different numbers of cells such that plotting all cells visually overwhelms small clusters. We will probably want to summarize the expression of genes within clusters. So let’s summarize using the average expression and fraction of cells expressing the gene (ie. non-zero detection) per cluster.
## Summarize gene expression within groups
## average expression
mat.summary <- do.call(cbind, lapply(levels(com), function(ct) {
cells <- which(com==ct)
if(length(cells) > 1) {
Matrix::rowMeans(matnorm[, cells])
} else {
matnorm[,cells]
}
}))
colnames(mat.summary) <- levels(com)
## fraction expressing
fe.summary <- do.call(cbind, lapply(levels(com), function(ct) {
cells <- which(com==ct)
if(length(cells) > 1) {
Matrix::rowMeans(matnorm[, cells]>0)
} else {
matnorm[,cells]>0
}
}))
colnames(fe.summary) <- levels(com)
For the sake of demonstration, we will just select a handful of significantly differentially expressed genes to visualize.
## Order groups
dend <- hclust(dist(t(mat.summary)), method='ward.D')
## Select a few differential genes
dg.sig <- lapply(dg, function(x) {
x <- x[x$Z>1.96,] ## significant z-score
x <- x[x$highest,]
x <- head(na.omit(x))
return(x)
})
names(dg.sig) <- levels(com)
## Get summary matrices for differential genes only
select.groups <- dend$labels[dend$order]
dg.genes <- unlist(lapply(dg.sig[select.groups], rownames))
m <- mat.summary[dg.genes, select.groups]
m <- t(scale(t(m)))
fe <- fe.summary[dg.genes, select.groups]
We could plot just a conventional heatmap of the average gene expression per cluster for all our differentially expressed genes.
library(highcharter)
library(magrittr)
## Visualize as a regular heatmap
hc = hchart(t(m), "heatmap", hcaes(x = group, y = gene, value = value)) %>%
hc_colorAxis(stops = color_stops(10, colorRampPalette(c("steelblue", "white", "red"), space = "Lab")(10))) %>%
hc_title(text = "Differentially Expressed Genes Heatmap") %>%
hc_plotOptions(series = list(dataLabels = list(enabled = FALSE)))
export_hc(filename='heatmap1', hc)
But when exploring our data, maybe we want to encode in additional levels of information such as the fraction of cells expressing the gene.
library(reshape2)
## Visualize fraction expressing as dot size
## reshape for highcharter
mm <- reshape2::melt(m)
mfe <- reshape2::melt(fe)
colnames(mm) <- c('gene', 'group', 'value')
mm <- cbind(mm, fe=mfe$value)
## convert value to colors
colors <- mm$value
colors <- colors - min(colors)
colors <- colors/max(colors)
gradientPalette <- colorRampPalette(c("steelblue", "white", "red"), space = "Lab")(1024)
cols <- gradientPalette[round(colors * (length(gradientPalette) - 1) + 1)]
cols[is.na(cols)] <- '#eeeeee'
mm$color = cols
## convert group and gene names to indices for plotting
mm$x = sapply(mm$gene, function(x) which(dg.genes==x))
mm$y = -sapply(mm$group, function(x) which(select.groups==x))
## visualize as bubble plot
hc = hchart(mm, type="scatter", mapping = hcaes(x=x, y=y, z=fe,
mag=value, color=color,
fe=fe, name=group, gene=gene)) %>%
hc_plotOptions(bubble = list(minSize=1, maxSize=10)) %>%
hc_title(text = 'Heatmap Alternative') %>%
hc_tooltip(headerFormat = "", pointFormat = "<b>group</b>: {point.name} <br> <b>gene</b>: {point.gene} <br> <b>fraction expressing</b>: {point.fe} <br> <b>z-score</b>: {point.mag}") %>%
hc_xAxis(visible=FALSE) %>%
hc_yAxis(visible=FALSE) %>%
hc_plotOptions(series = list(showInLegend=FALSE))
export_hc(filename='heatmap2', hc)
Now we can more easily see which gene may or may not be a better facs marker based on its fractional expression in our cell subpopulation of interest compared to the fractional expression in other subpopulations.
Or maybe we want to encode in AUC metrics, gene expression variance, or other information. However, keep in mind that the purposes of your visualization may be different depending on your goals. If your goal is to more easily explore your data, a more complex visualization with lots of encoded information may be warranted. If your goal is to communicate information, a simpler standard heatmap that’s easier to understand may be sufficient.
Additional Resources
- Older
- Newer
RECENT POSTS
- Impact of normalizing spatial transcriptomics data in dimensionality reduction and clustering versus deconvolution analysis with STdeconvolve on 04 May 2023
- Aligning Spatial Transcriptomics Data With Stalign on 16 April 2023
- 3D animation of the brain in R on 08 November 2022
- Ethical Challenges in Biomedical Engineering - Data Collection, Analysis, and Interpretation on 15 October 2022
- I use R to (try to) figure out the cost of medical procedures by analyzing insurance data from the Transparency in Coverage Final Rule on 12 September 2022
- Annotating STdeconvolve Cell-Types with ASCT+B Tables on 30 August 2022
- Deconvolution vs Clustering Analysis: An exploration via simulation on 11 July 2022
- Coloring SVGs in R on 17 June 2022
- Deconvolution vs Clustering Analysis for Multi-cellular Pixel-Resolution Spatially Resolved Transcriptomics Data on 03 May 2022
- Exploring UMAP parameters in visualizing single-cell spatially resolved transcriptomics data on 19 January 2022
TAGS
RELATED POSTS
- Exploring UMAP parameters in visualizing single-cell spatially resolved transcriptomics data
- Animating RNA velocity with moving arrows
- A tale of two cell populations: integrating RNA velocity information in single cell transcriptomic data visualization with VeloViz
- Story-telling with Data Visualization