Apr 6, 2020

# Quickly Creating Pseudobulks from Single Cell Gene Expression Data with Linear Algebra

## Introduction

Single cell gene expression analysis often involves identifying transcriptionally distinct clusters of cells that may represent different cell-types or cell-states. After identifying these clusters of cells, we may want to summarize the gene expression profiles for each cluster by aggregating them into pseudobulks ie. the summed up gene expression profile for all cells in each cluster. I find pseudobulks really helpful in heatmap visualizations to look at top marker or differentially expressed genes, particularly if you have cell-types or cell-states across many patients or conditions. Here, I will compare the runtimes of two different ways to create these pseudobulks that I recently learned from Kamil Slowikowski using simulated data.

## The comparison

To begin, let’s simulate a gene expression matrix with 5,000 genes and 10,000 cells.

``````library(Matrix)
set.seed(0) ## enable reproducibility

## 5000 genes
nrow <- 5000
## 10000 cells
ncol <- 10000
## simulate sparse matrix with ~0.01% of values as 1s and rest as 0s
mat <- matrix(round(rbinom(nrow*ncol, 1, 0.01/100)), nrow, ncol)
dim(mat)
``````
``````## [1]  5000 10000
``````
``````## how many 1s (rest are 0s)
sum(mat)
``````
``````## [1] 4933
``````
``````rownames(mat) <- paste0('gene', 1:nrow)
colnames(mat) <- paste0('cell', 1:ncol)
## take a look
mat[1:5, 1:5]
``````
``````##       cell1 cell2 cell3 cell4 cell5
## gene1     0     0     0     0     0
## gene2     0     0     0     0     0
## gene3     0     0     0     0     0
## gene4     0     0     0     0     0
## gene5     0     0     0     0     0
``````
``````## convert to sparse matrix for faster computing
mat.sparse <- Matrix(mat, sparse = TRUE)
class(mat.sparse)
``````
``````## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
``````
``````## take a look
mat.sparse[1:5, 1:5]
``````
``````## 5 x 5 sparse Matrix of class "dgCMatrix"
##       cell1 cell2 cell3 cell4 cell5
## gene1     .     .     .     .     .
## gene2     .     .     .     .     .
## gene3     .     .     .     .     .
## gene4     .     .     .     .     .
## gene5     .     .     .     .     .
``````

Let’s say these cells fall into 10 different arbitrary cell-types. Let’s simulate a cell-type annotation vector.

``````## make cell-type annotation information
celltype <- as.factor(sample(1:10, ncol, replace=TRUE))
levels(celltype) <- paste0('celltype', 1:10)
names(celltype) <- colnames(mat)
## take a look
``````
``````##      cell1      cell2      cell3      cell4      cell5      cell6
##  celltype8  celltype5  celltype4 celltype10  celltype2  celltype3
## 10 Levels: celltype1 celltype2 celltype3 celltype4 celltype5 ... celltype10
``````
``````## distribution of cells per cell-types
table(celltype)
``````
``````## celltype
##  celltype1  celltype2  celltype3  celltype4  celltype5  celltype6
##       1046        994       1007        998        968       1007
##  celltype7  celltype8  celltype9 celltype10
##       1012        970        988       1010
``````

To create a pseudobulk gene expression profile for each cell-type, we may loop through the cell-types using the `lapply` function and sum up the gene expression values for all cells belonging to each cell-type. We can then column bind these summed up gene expression values using `cbind` and end up with a matrix that has the total expression for our 5,000 gene for each of our 10 cell-types. Note, we will also use `Sys.time()` to keep track of how it takes to run this code for comparative purposes later.

``````## pseudobulk gene expression per cell-type
getPseudobulk <- function(mat, celltype) {
mat.summary <- do.call(cbind, lapply(levels(celltype), function(ct) {
cells <- names(celltype)[celltype==ct]
pseudobulk <- rowSums(mat[, cells])
return(pseudobulk)
}))
colnames(mat.summary) <- levels(celltype)
return(mat.summary)
}

## test runtime
start_time1 <- Sys.time()
## call function
mat.summary <- getPseudobulk(mat.sparse, celltype)
end_time1 <- Sys.time()

## take a look
dim(mat.summary)
``````
``````## [1] 5000   10
``````
``````head(mat.summary)
``````
``````##       celltype1 celltype2 celltype3 celltype4 celltype5 celltype6
## gene1         0         0         0         1         0         0
## gene2         0         0         0         0         0         0
## gene3         0         0         0         0         0         0
## gene4         0         0         1         0         0         0
## gene5         0         1         0         0         0         0
## gene6         0         0         0         0         0         0
##       celltype7 celltype8 celltype9 celltype10
## gene1         0         0         0          0
## gene2         0         1         0          0
## gene3         0         0         0          0
## gene4         0         0         0          0
## gene5         0         0         0          0
## gene6         0         0         0          0
``````

Alternatively, we can use linear algebra to come up with the same pseudobulk gene expression matrix. We can generate a model matrix where each row is a cell and each column is a cell-type. The model matrix entry is 1 if the cell belong to the cell-type, and 0 otherwise.

``````## make model matrix
mm <- model.matrix(~ 0 + celltype)
colnames(mm) <- levels(celltype)
## take alook
dim(mm)
``````
``````## [1] 10000    10
``````
``````head(mm)
``````
``````##   celltype1 celltype2 celltype3 celltype4 celltype5 celltype6 celltype7
## 1         0         0         0         0         0         0         0
## 2         0         0         0         0         1         0         0
## 3         0         0         0         1         0         0         0
## 4         0         0         0         0         0         0         0
## 5         0         1         0         0         0         0         0
## 6         0         0         1         0         0         0         0
##   celltype8 celltype9 celltype10
## 1         1         0          0
## 2         0         0          0
## 3         0         0          0
## 4         0         0          1
## 5         0         0          0
## 6         0         0          0
``````

Now, to come up with the same pseudobulk gene expression matrix, all we need to do is perform a little matrix multiplication! Recall: “row onto column” for multiplying matrices.

``````## test runtime
start_time2 <- Sys.time()
## multiple row of mat.sparse (gene) onto column of the model matrix (cell-type annotation)
mat.summary.mm <- mat.sparse %*% mm
end_time2 <- Sys.time()

## take a look
dim(mat.summary.mm)
``````
``````## [1] 5000   10
``````
``````head(mat.summary.mm)
``````
``````## 6 x 10 Matrix of class "dgeMatrix"
##       celltype1 celltype2 celltype3 celltype4 celltype5 celltype6
## gene1         0         0         0         1         0         0
## gene2         0         0         0         0         0         0
## gene3         0         0         0         0         0         0
## gene4         0         0         1         0         0         0
## gene5         0         1         0         0         0         0
## gene6         0         0         0         0         0         0
##       celltype7 celltype8 celltype9 celltype10
## gene1         0         0         0          0
## gene2         0         1         0          0
## gene3         0         0         0          0
## gene4         0         0         0          0
## gene5         0         0         0          0
## gene6         0         0         0          0
``````

We can check that the results using these two approaches are indeed the same.

``````all.equal(as.matrix(mat.summary), as.matrix(mat.summary.mm))
``````
``````## [1] TRUE
``````

But the matrix multiplication is faster!

``````## cbind, lapply runtime
end_time1-start_time1
``````
``````## Time difference of 0.01023602 secs
``````
``````## matrix multiplication runtime
end_time2-start_time2
``````
``````## Time difference of 0.006520987 secs
``````

Such a model matrix approach can also more easily accomodate additional ways of splitting the cells. For example, consider if our data came from 3 different patients.

``````## more complex model matrices
patient <- as.factor(sample(1:3, ncol, replace=TRUE))
levels(patient) <- paste0('patient', 1:3)
names(patient) <- colnames(mat)
## take a look
``````
``````##    cell1    cell2    cell3    cell4    cell5    cell6
## patient2 patient1 patient2 patient1 patient2 patient1
## Levels: patient1 patient2 patient3
``````
``````## distribution of cells per patient
table(patient)
``````
``````## patient
## patient1 patient2 patient3
##     3270     3324     3406
``````

Perhaps we want to look at the pseudobulk gene expression profile for each cell-type within each patient. To do this, we can just add to our model matrix. Again each row is a cell but now each column is a combination of patient and cell-type. The model matrix entry is 1 if the cell came from the patient and is of the cell-type, and 0 otherwise.

``````mm2 <- model.matrix(~ 0 + patient:celltype)
dim(mm2)
``````
``````## [1] 10000    30
``````
``````mm2[1:10,1:4]
``````
``````##    patientpatient1:celltypecelltype1 patientpatient2:celltypecelltype1
## 1                                  0                                 0
## 2                                  0                                 0
## 3                                  0                                 0
## 4                                  0                                 0
## 5                                  0                                 0
## 6                                  0                                 0
## 7                                  0                                 0
## 8                                  0                                 1
## 9                                  0                                 0
## 10                                 0                                 0
##    patientpatient3:celltypecelltype1 patientpatient1:celltypecelltype2
## 1                                  0                                 0
## 2                                  0                                 0
## 3                                  0                                 0
## 4                                  0                                 0
## 5                                  0                                 0
## 6                                  0                                 0
## 7                                  0                                 0
## 8                                  0                                 0
## 9                                  0                                 0
## 10                                 0                                 0
``````

Now we end up with a pseudobulk gene expression matrix that has the total expression for our 5,000 gene for each of our 10 cell-types within each of our 3 patients.

``````mat.summary.mm2 <- mat.sparse %*% mm2
dim(mat.summary.mm2)
``````
``````## [1] 5000   30
``````
``````mat.summary.mm2[1:10,1:4]
``````
``````## 10 x 4 Matrix of class "dgeMatrix"
##        patientpatient1:celltypecelltype1 patientpatient2:celltypecelltype1
## gene1                                  0                                 0
## gene2                                  0                                 0
## gene3                                  0                                 0
## gene4                                  0                                 0
## gene5                                  0                                 0
## gene6                                  0                                 0
## gene7                                  0                                 0
## gene8                                  0                                 0
## gene9                                  0                                 1
## gene10                                 0                                 0
##        patientpatient3:celltypecelltype1 patientpatient1:celltypecelltype2
## gene1                                  0                                 0
## gene2                                  0                                 0
## gene3                                  0                                 0
## gene4                                  0                                 0
## gene5                                  0                                 0
## gene6                                  0                                 0
## gene7                                  0                                 0
## gene8                                  0                                 0
## gene9                                  0                                 0
## gene10                                 0                                 0
``````

## Caveats

However, note that matrix multiplication is only faster in this setting since our gene expression matrix is sparse.

## Try it out for yourself!

• What happens if we use a regular dense expression matrix instead of a sparse matrix?
• Does the improvement in speed depend on the number of genes or number of cells?