Interactive Tsne

Exploring the impact of tSNE parameters interactively

t-SNE is a popular dimensionality reduction method for, among many other things, visualizing transcriptional subpopulations from single-cell RNA-seq data. However, the appropriateness of parameters used are not often clear and may result in misleading embeddings. In particular, the choice of how many features (either genes or PCs), how many effective neighbors (perplexity), and distance metric, all affect the resulting 2D tSNE embedding. Building on previously developed Javascript-based implementations of the t-SNE algorithm, in this blog post, I present a means to interactively explore the impact of these parameters on visualizing identified subpopulations.

The Data

I downloaded ~2000 single cell count matrix from a PBMC dataset (Donor A) from 10X Genomics. I used R and the MUDAN package I'm developing to perform some of the heavy lifting: normalization, PCA dimensionality reduction, and subsetting to get the data into a form that wouldn't kill your browser (important later). I also used a graph-based community detection method to annotation clusters so we can see how different tSNE parameters segregate them later.

library(MUDAN)
data(pbmcA)
## filter out poor genes and cells
cd <- cleanCounts(pbmcA,
                  min.reads = 10,
                  min.detected = 10,
                  verbose=FALSE)
## CPM normalization
mat <- normalizeCounts(cd,
                       verbose=FALSE)
## variance normalize, identify overdispersed genes
matnorm.info <- normalizeVariance(mat,
                                  details=TRUE,
                                  verbose=FALSE)
## log transform
matnorm <- log10(matnorm.info$mat+1)
## 30 PCs on overdispersed genes
pcs <- getPcs(matnorm[matnorm.info$ods,],
              nGenes=length(matnorm.info$ods),
              nPcs=100,
              verbose=FALSE)

## graph-based community detection; over cluster with small k
com <- getComMembership(pcs,
                        k=10, method=igraph::cluster_louvain,
                        verbose=FALSE)

## write out subset of data
sub <- 1:500
m <- pcs[sub,]
df <- data.frame(group=com[rownames(m)], m)
write.csv(df, file='pbmcA.txt', quote=TRUE)

The result is a 500x100 matrix of numbers (500 cells, 100 PCs). To enable interactive exploration, all downstream computation, including computing distances on the PCs, getting the tSNE embedding, and visualizing the tSNE embedding, is done in Javascript. Javascript, as a web-based client-side language, is less powerful than R but still does a reasonable job and, of course, enables the user to click and interactively explore without any programming like any web app. Actually, embedded in the browser are the first 100 PCs from 500 single cells. (Right click and view the page source to check it out for yourself!). It does seem like storing a 500x100 matrix in memory using untyped Javascript arrays is pushing the limits of my laptop. My browser crashed when I tried having the full 2000 single cell dataset.

Anyways, because the tSNE computation and visualization is done in Javascript, we can interactively assess how tSNE parameters impact our final 2D visualization without reliance on external servers or installations. Click the run button and try it out for yourself! What happens when we set perplexity to something really small? Or really big? What happens if we use very few PCs? A lot of PCs? Does the distance metric matter?

Try it out!

Number of PCs 30

Perplexity 30

Learning Rate 10

Max Iterations 500

Distance Metric

Using your own data

You can either look through the source code of this blog post, or check out: http://jef.works/tsne-online/.

Concluding thoughts

tSNE is a visualization tool. We must be aware of the impact of parameters on our visualizations and not over-interpret clusters that appear coherent in our tSNE embeddings that may not be reflective of actually coherent or stable subpopulations in higher-dimensional space.

Additional resources

(thanks to Fritz Lekschas for sharing)