Exploring the impact of tSNE parameters interactively
I downloaded ~2000 single cell count matrix from a PBMC dataset (Donor A) from 10X Genomics. I used R and the MUDAN package I'm developing to perform some of the heavy lifting: normalization, PCA dimensionality reduction, and subsetting to get the data into a form that wouldn't kill your browser (important later). I also used a graph-based community detection method to annotation clusters so we can see how different tSNE parameters segregate them later.
library(MUDAN) data(pbmcA) ## filter out poor genes and cells cd <- cleanCounts(pbmcA, min.reads = 10, min.detected = 10, verbose=FALSE) ## CPM normalization mat <- normalizeCounts(cd, verbose=FALSE) ## variance normalize, identify overdispersed genes matnorm.info <- normalizeVariance(mat, details=TRUE, verbose=FALSE) ## log transform matnorm <- log10(matnorm.info$mat+1) ## 30 PCs on overdispersed genes pcs <- getPcs(matnorm[matnorm.info$ods,], nGenes=length(matnorm.info$ods), nPcs=100, verbose=FALSE) ## graph-based community detection; over cluster with small k com <- getComMembership(pcs, k=10, method=igraph::cluster_louvain, verbose=FALSE) ## write out subset of data sub <- 1:500 m <- pcs[sub,] df <- data.frame(group=com[rownames(m)], m) write.csv(df, file='pbmcA.txt', quote=TRUE)
Try it out!
Using your own data
You can either look through the source code of this blog post, or check out: http://jef.works/tsne-online/.
tSNE is a visualization tool. We must be aware of the impact of parameters on our visualizations and not over-interpret clusters that appear coherent in our tSNE embeddings that may not be reflective of actually coherent or stable subpopulations in higher-dimensional space.
(thanks to Fritz Lekschas for sharing)