Feb 10, 2018

Explore more posts
by Tags | by Date | View All
Keep up to date
  • Interactive Visualizations With Highcharter


    Let’s start this blog post with a story. A few years ago, I went to a small meeting focused on single cell computational analyses hosted at Cold Spring Harbor. I was giving a talk on my work and had the chance to participate in a group forum on the future of the field, in particular the computational challenges and opportunities, afterwards. A bioinformatician, Timothy Hu, asked what I thought was a spectacular and fundamental question: why is it so hard to reproduce figures from publications? Especially when, in theory, all the code and data for generating the figure could (and should) be available? I talked to Tim afterwards and we envisioned a future where all figures in electronic journals would be interactive, with all relevant data and code used to generate the figure actually being embedded in the figure itself. We differed on the implementation details of such a vision - Tim wanted a system with lots of server-side operations on the cloud while I believed a system with primarily client-side operations with Javascript will be more sustainable. But back then, the vision seemed like just a dream. How would I even implement such a thing? Where do I even begin?

    Now, a few years later and being much wiser (or at least more skilled programming-wise), it’s time to revisit and reassess the feasibility of this vision by attempting to implement it: Can I create a blog post with all interactive figures that have all data and code used to create the figures embedded in the blog post itself?

    For demonstrative purposes, let’s analyze a PBMC dataset from 10X Genomics (pbmcA). The dataset is available as part of the MUDAN R package I’m developing.

    ##devtools::install_github('JEFworks/MUDAN) ## install
    library(MUDAN)
    data(pbmcA)
    

    We will use highcharter, an excellent R package by Joshua Kunst for creating highcharts.js plots in R, to create plots that we can then export as javascript and embed in this blog post.

    ##install.packages(highcharter)
    library(highcharter)
    

    We can start with some simple filtering and QC visualizations.

    ## Filter out poor quality cells and genes
    cd <- cleanCounts(pbmcA, min.lib.size=1000, max.lib.size=5000, min.read=10, min.detected=5)
    
    ## Look at some QC
    hchart(log10(colSums(cd>0)+1)) %>%
      hc_title(text = "log10(nUMI+1)") %>%
      hc_legend(enabled = FALSE)
    hchart(log10(rowSums(cd)+1)) %>%
      hc_title(text = "log10(nCells+1)") %>%
      hc_legend(enabled = FALSE)
    

    Ok. Our data looks quite reasonable and normally distributed. Now, let’s do our standard single cell analysis and get to a tSNE embedding with detected groups that we want to visualize.

    ## Normalize to CPMs
    mat <- normalizeCounts(cd)
    ## Variance normalize and log transform
    matnorm.info <- normalizeVariance(mat, details=TRUE)
    ## Take overdispersed genes only
    matnorm <- log10(matnorm.info$mat[matnorm.info$ods,]+1)
    ## Get top 30 PCs on all overdispersed genes
    pcs <- getPcs(matnorm, nGenes=nrow(matnorm), nPcs=30)
    ## Graph-based community detection
    com <- getComMembership(pcs, k=30, method=igraph::cluster_infomap)
    ## tSNE embedding
    emb <- Rtsne::Rtsne(t(pcs), is_distance=FALSE, perplexity=30, verbose=TRUE)$Y
    rownames(emb) <- colnames(matnorm)
    
    ## Plot embedding with infomap group labels
    df <- data.frame(x=emb[,1], y=emb[,2], group=com, name=rownames(emb))
    hchart(df, "scatter", hcaes(x, y, group=group, name=name)) %>%
      hc_tooltip(headerFormat = "", pointFormat = "<b>group</b>: {series.name} <br> <b>cell</b>: {point.name}")
    

    Hey, look some clusters! (play with the interactive plot by toggling the visibility of clusters)

    We can perform differential expression analysis among these clusters and visualize significantly differentially expressed genes as a heatmap.

    ## Differential expression analysis
    dg <- getDifferentialGenes(t(cd), com)
    dg.sig <- lapply(dg, function(x) {
      ## get only significantly upregulated
      x <- x[x$Z>3,]
      x <- x[x$highest,]
      rownames(x)
    })
    ## average within groups
    mat.summary <- do.call(cbind, lapply(levels(com), function(ct) {
      cells <- which(com==ct)
      rowMeans(matnorm[, cells])
    }))
    colnames(mat.summary) <- levels(com)
    dim(mat.summary)
    ## cluster groups
    hc <- hclust(dist(t(mat.summary)))
    dg.sig.genes <- na.omit(intersect(unlist(dg.sig[hc$labels[hc$order]]), rownames(matnorm)))
    m <- mat.summary[dg.sig.genes,hc$labels[hc$order]]
    ## scale for better visualization
    m <- t(scale(t(m)))
    ## reshape for highcharter
    mm <- reshape2::melt(m)
    colnames(mm) <- c('gene', 'group', 'value')
    
    ## Visualize as a heatmap
    hchart(m, "heatmap", hcaes(x = gene, y = group, value = value)) %>% 
      hc_colorAxis(stops = color_stops(10, rev(RColorBrewer::brewer.pal(10, "RdBu")))) %>% 
      hc_title(text = "Differentiall Expressd Genes") %>% 
      hc_legend(layout = "vertical", verticalAlign = "top", align = "right", valueDecimals = 0) %>%
      hc_plotOptions(series = list(dataLabels = list(enabled = FALSE)))
    

    Heatmaps are always a bit hard to deal with because the rownames are just so dense. Now with an interactive plot, you can just hover to see the gene name and group assignment!

    We can then dig into the expression of one of these genes and visualize as either a box-plot across our identified groups or in the tSNE embedding.

    ## Boxplot of one diff gene
    gexp <- log10(mat['CD3E', ]+1)
    hcboxplot(x = gexp, var = com,  outliers = FALSE) %>%
      hc_chart(type = "column") %>%
      hc_title(text = "CD3E Expression") 
    
    ## Plot expression in tSNE
    colors <- gexp
    colors <- colors/max(colors)
    gradientPalette <- colorRampPalette(c("gray80", "red"), space = "Lab")(1024)
    cols <- gradientPalette[colors * (length(gradientPalette) - 1) + 1]
    names(cols) <- names(gexp)
    df <- data.frame(x=emb[,1], y=emb[,2], value=gexp, color=cols, name=rownames(emb))
    hchart(df, "scatter", hcaes(x, y, color=color, value=value, name=name)) %>%
      hc_tooltip(headerFormat = "", pointFormat = "<b>cell</b>: {point.name} <br> <b>CD3E exp</b>: {point.value}")
    

    We can go on with downstream characterizations of these subpopulations. But if you right click and view the page source, you will find that all the data used to generate these plots are accessible. It’s not in the prettiest format (all JSONs), but the data is there. And if you wanted to copy all the source code and paste it to another web page, you would recreate all the figures. The figures are somewhat interactive, albeit not linked. But the conclusion is that: Yes! I create a blog post with all interactive figures that have all data and code used to create the figures embedded in the blog post itself.