Aug 25, 2020

Explore more posts
by Tags | by Date | View All

Keep up to date

Using scVelo in R using Reticulate

Introduction

Although scRNA-seq has enabled us to characterize transcriptomes at single cell resolution to identify transcriptionally distinct cell-types and cell-states, it is a destructive protocol and thus only allows us to capture a static snapshot in time. As we are often interested in how cells change during dynamic processes such as differentiation, tumor development, drug response, etc, we would like to infer some information regarding directed temporal dynamics from this static snapshot. RNA velocity analysis is a computational approach that allows us to computationally predict where the cell is “heading” in terms of its gene expression (e.g. predict the future transcriptional state of cells). Details of RNA velocity analysis and modeling along with accompany software as part of the velocyto.R R package can be found in the original publication by La Manno et al., (2018). Recently, RNA velocity analysis was expanded by Bergen et al (2020) to further enable inference of gene-specific rates of transcription, splicing and degregation, accomodate transient cell-states, among other cool features and has been made available as part of the scvelo Python package.

I am personally much more familiar with R programming and generally prefer to stay within one programming language for reproducibility purposes. So rather than switching to Python to use scvelo, in this tutorial, I will demo the use scvelo from within R using R’s reticulate package.

Setting up

First, we will need to install reticulate. It also helps to have Conda (https://docs.conda.io/en/latest/) installed to manage Python.

# install.packages("reticulate")
library(reticulate)

I previously created a Conda environment called r-velocity and installed scvelo via:

# bash
pip install -U scvelo

You can double check in Python that the install worked using:

# bash
conda activate r-velocity
python
>>> import scvelo as scv

Now we can load up the appropriate Conda environment that has scvelo installed.

conda_list()

##           name                                      python
## 1    anaconda3                   /opt/anaconda3/bin/python
## 2 r-reticulate /opt/anaconda3/envs/r-reticulate/bin/python
## 3   r-velocity   /opt/anaconda3/envs/r-velocity/bin/python

use_condaenv("r-velocity", required = TRUE)
scv <- import("scvelo")
scv$logging$print_version()

## Running scvelo 0.2.2 (python 3.8.5) on 2020-08-25 20:02.

scvelo

For demo purposes, we will work through the following tutorial (https://scvelo.readthedocs.io/) and work through how to interface with the results in R. We will download and use the built-in pancreas dataset.

adata <- scv$datasets$pancreas()
adata

## AnnData object with n_obs × n_vars = 3696 × 27998
##     obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score'
##     var: 'highly_variable_genes'
##     uns: 'clusters_coarse_colors', 'clusters_colors', 'day_colors', 'neighbors', 'pca'
##     obsm: 'X_pca', 'X_umap'
##     layers: 'spliced', 'unspliced'
##     obsp: 'distances', 'connectivities'

A UMAP embedding has already been generated and is accessible in the pancreas dataset’s AnnData object. The same goes for cluster annotations. We can plot results using scvelo’s plotting functions. Note, this will result in a pop-up window.

scv$pl$scatter(adata, legend_loc='lower left', size=60)

We can also access these elements and plot them within R.

## get embedding
emb <- adata$obsm['X_umap']
clusters <- adata$obs$clusters
rownames(emb) <- names(clusters) <- adata$obs_names$values

## get clusters, convert to colors
col <- rainbow(length(levels(clusters)), s=0.8, v=0.8)
cell.cols <- col[clusters]
names(cell.cols) <- names(clusters)

## simple plot
plot(emb, col=cell.cols, pch=16,
     xlab='UMAP X', ylab='UMAP Y')
legend(x=-13, y=0, 
       legend=levels(clusters),
       col=col, 
       pch=16)

Now, let’s actually run scvelo’s dynamic RNA velocity modeling on this pancreas data from within R! Note, that the main difference in terms of R function calls compared to python is now . are $. Other than that, passing objects through py_to_r or r_to_py can fix a lot of errors.

## run scvelo dynamic model
scv$pp$filter_genes(adata) ## filter
scv$pp$moments(adata) ## normalize and compute moments
scv$tl$recover_dynamics(adata) ## model

## takes awhile, so uncomment to save
#adata$write('data/pancreas.h5ad', compression='gzip')
#adata = scv$read('data/pancreas.h5ad')

We can visualize the dynamic RNA velocities on the UMAP embedding using scvelo’s plotting functions. Again, this will result in a pop-up window.

## plot (creates pop up window)
scv$tl$velocity(adata, mode='dynamical')
scv$tl$velocity_graph(adata)
scv$pl$velocity_embedding_stream(adata, basis='umap')
## scv$pl$velocity_embedding_stream(adata, basis='pca') ## other embedding

We can also pull out the top genes driving the dynamic RNA velocities.

## top dynamic genes
topgenes <- adata$var["fit_likelihood"]
topgenes_vals <- topgenes[,1]
names(topgenes_vals) <- rownames(topgenes)
topgenes_vals <- sort(topgenes_vals, decreasing=TRUE)
head(topgenes_vals)

##     Pcsk2     Top2a      Rps3     Gng12      Pak3       Ank 
## 0.7811821 0.6639022 0.6355273 0.6131085 0.5956294 0.5955843

We can further visualize their phase diagrams using scvelo’s plotting functions.

scv$pl$scatter(adata, basis=names(topgenes_vals)[1:5], ncols=5, frameon=FALSE)

velocyto.R

By working within R, we can also easily compare results with the original, non-dynamic RNA velocity modeling with velocyto.R.

library(velocyto.R)

## Loading required package: Matrix

fit.quantile <- 0.1

## pull out spliced and unspliced matrices from AnnData
emat <- as.matrix(t(adata$layers['spliced']))
nmat <- as.matrix(t(adata$layers['unspliced']))
cells <- adata$obs_names$values
genes <- adata$var_names$values
colnames(emat) <- colnames(nmat) <- cells
rownames(emat) <- rownames(nmat) <- genes

## pull out PCA 
pcs <- adata$obsm['X_pca']
rownames(pcs) <- cells
cell.dist <- as.dist(1-cor(t(pcs))) ## cell distance in PC space

## filter genes
gexp1 <- log10(rowSums(emat)+1)
gexp2 <- log10(rowSums(nmat)+1)
#plot(gexp1, gexp2)
good.genes <- genes[gexp1 > 2 & gexp2 > 1]

Now we can run the RNA velocity model using velocyto.R.

## velocyto model
rvel.cd <- gene.relative.velocity.estimates(emat[good.genes,], nmat[good.genes,],
                                            deltaT=1, kCells=30,
                                            cell.dist=cell.dist,
                                            fit.quantile=fit.quantile, 
                                            mult=100)
## takes awhile, so uncomment to save
## save(rvel.cd, file="data/velocyto.RData")

We can visualize the RNA velocities on the original UMAP embedding using velocyto.R’s plotting functions.

## Plot on embedding
show.velocity.on.embedding.cor(scale(emb), rvel.cd, 
                               n = 100,
                               scale='sqrt',
                               cex=1, arrow.scale=1, show.grid.flow=TRUE,
                               min.grid.cell.mass=0.5, grid.n=30, arrow.lwd=2,
                               cell.colors=cell.cols)

We can also visualize phase diagrams for the top dynamic genes from scvelo using velocyto.R’s plotting functions.

## Gene plot
sapply(1:5, function(i) {
gene.relative.velocity.estimates(emat[good.genes,], nmat[good.genes,],
                                 kCells = 30,
                                 fit.quantile = fit.quantile,
                                 old.fit = rvel.cd,
                                 show.gene = names(topgenes_vals)[i],
                                 cell.emb = emb,
                                 cell.colors = cell.cols)
})

## calculating convolved matrices ... done

## calculating convolved matrices ... done

## calculating convolved matrices ... done

## calculating convolved matrices ... done

## calculating convolved matrices ... done

Additional tips

Alternatively, if we wanted to use our own data, we can create an AnnData object such as follows. We can then use scvelo to run analyses. If you are more comfortable in R like me, a lot of filtering, clustering, and generating embeddings can be made within R and put into the AnnData object such that scvelo is only used for the dynamic RNA velocity component.

ad <- import("anndata", convert = FALSE)
dfobs <- data.frame(clusters, annotations)
rownames(dfobs) <- cells
dfvar <- data.frame(genes_attributes)
rownames(dfvar) <- genes
adata <- ad$AnnData(
  X=t(expression_matrix),
  obs=dfobs,
  var=dfvar,
  layers=list('spliced'=t(emat), 'unspliced'=t(nmat)),
  obsm=list('X_tsne'=emb, 'X_pca'=pcs$x[,1:2]) 
  )

Try it out for yourself!

Session info

sessionInfo()

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] velocyto.R_0.6  Matrix_1.2-18   reticulate_1.16
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.5          knitr_1.29          cluster_2.1.0      
##  [4] magrittr_1.5        BiocGenerics_0.34.0 splines_4.0.2      
##  [7] MASS_7.3-51.6       lattice_0.20-41     rlang_0.4.7        
## [10] stringr_1.4.0       tools_4.0.2         parallel_4.0.2     
## [13] grid_4.0.2          Biobase_2.48.0      nlme_3.1-148       
## [16] mgcv_1.8-31         xfun_0.16           htmltools_0.5.0    
## [19] yaml_2.2.1          digest_0.6.25       evaluate_0.14      
## [22] rmarkdown_2.3       stringi_1.4.6       compiler_4.0.2     
## [25] pcaMethods_1.80.0   jsonlite_1.7.0