Apr 6, 2020

Explore more posts
by Tags | by Date | View All
Keep up to date
  • Quickly Creating Pseudobulks


    Quickly Creating Pseudobulks from Single Cell Gene Expression Data with Linear Algebra

    Introduction

    Single cell gene expression analysis often involves identifying transcriptionally distinct clusters of cells that may represent different cell-types or cell-states. After identifying these clusters of cells, we may want to summarize the gene expression profiles for each cluster by aggregating them into pseudobulks ie. the summed up gene expression profile for all cells in each cluster. I find pseudobulks really helpful in heatmap visualizations to look at top marker or differentially expressed genes, particularly if you have cell-types or cell-states across many patients or conditions. Here, I will compare the runtimes of two different ways to create these pseudobulks that I recently learned from Kamil Slowikowski using simulated data.

    The comparison

    To begin, let’s simulate a gene expression matrix with 5,000 genes and 10,000 cells.

    library(Matrix)
    set.seed(0) ## enable reproducibility
    
    ## 5000 genes
    nrow <- 5000
    ## 10000 cells
    ncol <- 10000
    ## simulate sparse matrix with ~0.01% of values as 1s and rest as 0s
    mat <- matrix(round(rbinom(nrow*ncol, 1, 0.01/100)), nrow, ncol)
    dim(mat)
    
    ## [1]  5000 10000
    
    ## how many 1s (rest are 0s)
    sum(mat)
    
    ## [1] 4933
    
    rownames(mat) <- paste0('gene', 1:nrow)
    colnames(mat) <- paste0('cell', 1:ncol)
    ## take a look
    mat[1:5, 1:5]
    
    ##       cell1 cell2 cell3 cell4 cell5
    ## gene1     0     0     0     0     0
    ## gene2     0     0     0     0     0
    ## gene3     0     0     0     0     0
    ## gene4     0     0     0     0     0
    ## gene5     0     0     0     0     0
    
    ## convert to sparse matrix for faster computing
    mat.sparse <- Matrix(mat, sparse = TRUE)
    class(mat.sparse)
    
    ## [1] "dgCMatrix"
    ## attr(,"package")
    ## [1] "Matrix"
    
    ## take a look
    mat.sparse[1:5, 1:5]
    
    ## 5 x 5 sparse Matrix of class "dgCMatrix"
    ##       cell1 cell2 cell3 cell4 cell5
    ## gene1     .     .     .     .     .
    ## gene2     .     .     .     .     .
    ## gene3     .     .     .     .     .
    ## gene4     .     .     .     .     .
    ## gene5     .     .     .     .     .
    

    Let’s say these cells fall into 10 different arbitrary cell-types. Let’s simulate a cell-type annotation vector.

    ## make cell-type annotation information
    celltype <- as.factor(sample(1:10, ncol, replace=TRUE))
    levels(celltype) <- paste0('celltype', 1:10)
    names(celltype) <- colnames(mat)
    ## take a look
    head(celltype)
    
    ##      cell1      cell2      cell3      cell4      cell5      cell6 
    ##  celltype8  celltype5  celltype4 celltype10  celltype2  celltype3 
    ## 10 Levels: celltype1 celltype2 celltype3 celltype4 celltype5 ... celltype10
    
    ## distribution of cells per cell-types
    table(celltype)
    
    ## celltype
    ##  celltype1  celltype2  celltype3  celltype4  celltype5  celltype6 
    ##       1046        994       1007        998        968       1007 
    ##  celltype7  celltype8  celltype9 celltype10 
    ##       1012        970        988       1010
    

    To create a pseudobulk gene expression profile for each cell-type, we may loop through the cell-types using the lapply function and sum up the gene expression values for all cells belonging to each cell-type. We can then column bind these summed up gene expression values using cbind and end up with a matrix that has the total expression for our 5,000 gene for each of our 10 cell-types. Note, we will also use Sys.time() to keep track of how it takes to run this code for comparative purposes later.

    ## pseudobulk gene expression per cell-type
    getPseudobulk <- function(mat, celltype) {
       mat.summary <- do.call(cbind, lapply(levels(celltype), function(ct) {
         cells <- names(celltype)[celltype==ct]
         pseudobulk <- rowSums(mat[, cells])
         return(pseudobulk)
       }))
       colnames(mat.summary) <- levels(celltype)
       return(mat.summary)
    }
    
    ## test runtime
    start_time1 <- Sys.time()
    ## call function
    mat.summary <- getPseudobulk(mat.sparse, celltype)
    end_time1 <- Sys.time()
    
    ## take a look
    dim(mat.summary)
    
    ## [1] 5000   10
    
    head(mat.summary)
    
    ##       celltype1 celltype2 celltype3 celltype4 celltype5 celltype6
    ## gene1         0         0         0         1         0         0
    ## gene2         0         0         0         0         0         0
    ## gene3         0         0         0         0         0         0
    ## gene4         0         0         1         0         0         0
    ## gene5         0         1         0         0         0         0
    ## gene6         0         0         0         0         0         0
    ##       celltype7 celltype8 celltype9 celltype10
    ## gene1         0         0         0          0
    ## gene2         0         1         0          0
    ## gene3         0         0         0          0
    ## gene4         0         0         0          0
    ## gene5         0         0         0          0
    ## gene6         0         0         0          0
    

    Alternatively, we can use linear algebra to come up with the same pseudobulk gene expression matrix. We can generate a model matrix where each row is a cell and each column is a cell-type. The model matrix entry is 1 if the cell belong to the cell-type, and 0 otherwise.

    ## make model matrix
    mm <- model.matrix(~ 0 + celltype)
    colnames(mm) <- levels(celltype)
    ## take alook
    dim(mm)
    
    ## [1] 10000    10
    
    head(mm)
    
    ##   celltype1 celltype2 celltype3 celltype4 celltype5 celltype6 celltype7
    ## 1         0         0         0         0         0         0         0
    ## 2         0         0         0         0         1         0         0
    ## 3         0         0         0         1         0         0         0
    ## 4         0         0         0         0         0         0         0
    ## 5         0         1         0         0         0         0         0
    ## 6         0         0         1         0         0         0         0
    ##   celltype8 celltype9 celltype10
    ## 1         1         0          0
    ## 2         0         0          0
    ## 3         0         0          0
    ## 4         0         0          1
    ## 5         0         0          0
    ## 6         0         0          0
    

    Now, to come up with the same pseudobulk gene expression matrix, all we need to do is perform a little matrix multiplication! Recall: “row onto column” for multiplying matrices.

    ## test runtime
    start_time2 <- Sys.time()
    ## multiple row of mat.sparse (gene) onto column of the model matrix (cell-type annotation)
    mat.summary.mm <- mat.sparse %*% mm
    end_time2 <- Sys.time()
    
    ## take a look
    dim(mat.summary.mm)
    
    ## [1] 5000   10
    
    head(mat.summary.mm)
    
    ## 6 x 10 Matrix of class "dgeMatrix"
    ##       celltype1 celltype2 celltype3 celltype4 celltype5 celltype6
    ## gene1         0         0         0         1         0         0
    ## gene2         0         0         0         0         0         0
    ## gene3         0         0         0         0         0         0
    ## gene4         0         0         1         0         0         0
    ## gene5         0         1         0         0         0         0
    ## gene6         0         0         0         0         0         0
    ##       celltype7 celltype8 celltype9 celltype10
    ## gene1         0         0         0          0
    ## gene2         0         1         0          0
    ## gene3         0         0         0          0
    ## gene4         0         0         0          0
    ## gene5         0         0         0          0
    ## gene6         0         0         0          0
    

    We can check that the results using these two approaches are indeed the same.

    all.equal(as.matrix(mat.summary), as.matrix(mat.summary.mm))
    
    ## [1] TRUE
    

    But the matrix multiplication is faster!

    ## cbind, lapply runtime
    end_time1-start_time1
    
    ## Time difference of 0.01023602 secs
    
    ## matrix multiplication runtime
    end_time2-start_time2
    
    ## Time difference of 0.006520987 secs
    

    Additional Tips

    Such a model matrix approach can also more easily accomodate additional ways of splitting the cells. For example, consider if our data came from 3 different patients.

    ## more complex model matrices
    patient <- as.factor(sample(1:3, ncol, replace=TRUE))
    levels(patient) <- paste0('patient', 1:3)
    names(patient) <- colnames(mat)
    ## take a look
    head(patient)
    
    ##    cell1    cell2    cell3    cell4    cell5    cell6 
    ## patient2 patient1 patient2 patient1 patient2 patient1 
    ## Levels: patient1 patient2 patient3
    
    ## distribution of cells per patient
    table(patient)
    
    ## patient
    ## patient1 patient2 patient3 
    ##     3270     3324     3406
    

    Perhaps we want to look at the pseudobulk gene expression profile for each cell-type within each patient. To do this, we can just add to our model matrix. Again each row is a cell but now each column is a combination of patient and cell-type. The model matrix entry is 1 if the cell came from the patient and is of the cell-type, and 0 otherwise.

    mm2 <- model.matrix(~ 0 + patient:celltype)
    dim(mm2)
    
    ## [1] 10000    30
    
    mm2[1:10,1:4]
    
    ##    patientpatient1:celltypecelltype1 patientpatient2:celltypecelltype1
    ## 1                                  0                                 0
    ## 2                                  0                                 0
    ## 3                                  0                                 0
    ## 4                                  0                                 0
    ## 5                                  0                                 0
    ## 6                                  0                                 0
    ## 7                                  0                                 0
    ## 8                                  0                                 1
    ## 9                                  0                                 0
    ## 10                                 0                                 0
    ##    patientpatient3:celltypecelltype1 patientpatient1:celltypecelltype2
    ## 1                                  0                                 0
    ## 2                                  0                                 0
    ## 3                                  0                                 0
    ## 4                                  0                                 0
    ## 5                                  0                                 0
    ## 6                                  0                                 0
    ## 7                                  0                                 0
    ## 8                                  0                                 0
    ## 9                                  0                                 0
    ## 10                                 0                                 0
    

    Now we end up with a pseudobulk gene expression matrix that has the total expression for our 5,000 gene for each of our 10 cell-types within each of our 3 patients.

    mat.summary.mm2 <- mat.sparse %*% mm2
    dim(mat.summary.mm2)
    
    ## [1] 5000   30
    
    mat.summary.mm2[1:10,1:4]
    
    ## 10 x 4 Matrix of class "dgeMatrix"
    ##        patientpatient1:celltypecelltype1 patientpatient2:celltypecelltype1
    ## gene1                                  0                                 0
    ## gene2                                  0                                 0
    ## gene3                                  0                                 0
    ## gene4                                  0                                 0
    ## gene5                                  0                                 0
    ## gene6                                  0                                 0
    ## gene7                                  0                                 0
    ## gene8                                  0                                 0
    ## gene9                                  0                                 1
    ## gene10                                 0                                 0
    ##        patientpatient3:celltypecelltype1 patientpatient1:celltypecelltype2
    ## gene1                                  0                                 0
    ## gene2                                  0                                 0
    ## gene3                                  0                                 0
    ## gene4                                  0                                 0
    ## gene5                                  0                                 0
    ## gene6                                  0                                 0
    ## gene7                                  0                                 0
    ## gene8                                  0                                 0
    ## gene9                                  0                                 0
    ## gene10                                 0                                 0
    

    Caveats

    However, note that matrix multiplication is only faster in this setting since our gene expression matrix is sparse.

    Try it out for yourself!

    • What happens if we use a regular dense expression matrix instead of a sparse matrix?
    • Does the improvement in speed depend on the number of genes or number of cells?

    Additional reading