Jan 5, 2018
Illustrating the impact of Simpson’s Paradox and the need to Single Cell measurements
Simpson’s paradox is a phenomenon in probability and statistics in which a trend appears in several different groups of data but disappears or reverses when these groups are combined. It is often cited as a reason for single cell measurements. Here, we illustrate Simpson’s paradox with an explicit hypothetical example relevant to single cell RNA-seq.
Consider a tumor sample pre- and post- drug treatment.
The tumor consists of 3 distinct subpopulations A, B, and C (perhaps they’re genetically distinct with different driver mutations). Prior to treatment, the dominant subpopulation was C, comprising 80% of the tumor, while subpopulations B and A comprised 16% and 4% of the tumor respectively. After treatment, perhaps in response to selective pressures imposed by the treatment, subpopulation C has shrunken while subpopulation A has grown to become the new dominant subpopulation. Now subpopulation C comprises only 4% of the tumor while subpopulation A comprises 80%. The proportion of subpopulation B is unchanged.
A summary of the information thus far:
Consider that we are interested in the impact of the drug treatment on a particular gene, Gene X. Gene X may be a target within the pathway affected by the drug treatment.
Prior to treatment, subpopulation A expressed Gene X at an average log CPM of 0.1, while subpopulation B expressed this gene at a higher level of 1.5 and subpopulation C at 3.0. Using bulk RNA sequencing, which does not allow us to distinguish among these subpopulations, we would observe an average expression of Gene X at 0.04*0.1+0.16*1.5+0.80*3=2.64. As expected, the gene is fairly highly expressed on average in the tumor, since it is highly expressed in subpopulation C, which is the dominant subpopulation in the tumor.
In response to the treatment, subpopulation A upregulates Gene X and now expresses it at an average log CPM of 0.3 (that’s a LOG2(0.3/0.1)=+1.58 LOG2 fold change). Similarly, subpopulation B upregulates Gene X to 1.8 (+0.26 LOG2 fold change) and subpopulation C upregulates gene X to 3.5 (+0.22 LOG2 fold change). However, using bulk RNA sequencing, because of the underlying shift in subpopulation proportion, what we would observe is an average expression of Gene X at 0.8*0.3+0.16*1.8+0.04*3.5=0.67. Now, the gene is fairly lowly expressed on average in the tumor because it is lowly expressed in subpopulation A, which is the new dominant subpopulation in the tumor. As you can see, this observed post-treatment average expression of Gene X is a decrease from our pre-treatment average, and with a not insignificant fold change of -1.98. Therefore, our bulk assay would lead us to believe that treatment with the drug results in downregulation of Gene X.
A summary of Expression of Gene X:
|Pre-treatment||Post-treatment||Log2 Fold Change|
So how can it be that each subpopulation INCREASED its expression of Gene X but on average we observe a DECREASE post-treatment? Herein lies Simpson’s Paradox. As elegantly stated by Cole Trapnell, “failing to properly compartmentalize the data by cell type leads to a qualitatively incorrect interpretation. The misleading effects of Simpson’s Paradox are likely to be pervasive in modern experiments using bulk assays.”
Of course, this is just a dummy hypothetical example. However, this happens in real biology as well. The best explicit example of Simpson’s Paradox I’ve seen in real biology (courtesy of Arjun Raj): linc-HOXA1 is a non-coding RNA that represses Hoxa1 in cis
- Automate testing of your R package using Travis CI, Codecov, and testthat on 17 February 2019
- Online bargain-hunting in R with rvest on 12 January 2019
- Interactive Exploration Of The Gender Pay Gap on 15 December 2018
- Nih F99 K00 Grant Tips Example And Personal Experience on 31 October 2018
- Single Cell Clustering Comparison on 28 June 2018
- Get your R package on CRAN in 10 steps on 18 June 2018
- Top 10 Must Use Terms To Get Your Next Nih Grant Funded on 11 June 2018
- Data Driven Faculty Job Search on 07 June 2018
- Visually Summarizing Differential Gene Expression on 25 April 2018
- Interactive Honeybadger Laf Profiles on 15 April 2018