Simpsons Paradox
Jan 5, 2018
Illustrating the impact of Simpson’s Paradox and the need to Single Cell measurements
Simpson’s paradox is a phenomenon in probability and statistics in which a trend appears in several different groups of data but disappears or reverses when these groups are combined. It is often cited as a reason for single cell measurements. Here, we illustrate Simpson’s paradox with an explicit hypothetical example relevant to single cell RNA-seq.
Consider a tumor sample pre- and post- drug treatment.
The tumor consists of 3 distinct subpopulations A, B, and C (perhaps they’re genetically distinct with different driver mutations). Prior to treatment, the dominant subpopulation was C, comprising 80% of the tumor, while subpopulations B and A comprised 16% and 4% of the tumor respectively. After treatment, perhaps in response to selective pressures imposed by the treatment, subpopulation C has shrunken while subpopulation A has grown to become the new dominant subpopulation. Now subpopulation C comprises only 4% of the tumor while subpopulation A comprises 80%. The proportion of subpopulation B is unchanged.
A summary of the information thus far:
Proportion | Pre-treatment | Post-treatment |
---|---|---|
Subpopulation A | 0.04 | 0.80 |
Subpopulation B | 0.16 | 0.16 |
Subpopulation C | 0.80 | 0.04 |
Total | 1.00 | 1.00 |
Consider that we are interested in the impact of the drug treatment on a particular gene, Gene X. Gene X may be a target within the pathway affected by the drug treatment.
Prior to treatment, subpopulation A expressed Gene X at an average log CPM of 0.1, while subpopulation B expressed this gene at a higher level of 1.5 and subpopulation C at 3.0. Using bulk RNA sequencing, which does not allow us to distinguish among these subpopulations, we would observe an average expression of Gene X at 0.04*0.1+0.16*1.5+0.80*3=2.64. As expected, the gene is fairly highly expressed on average in the tumor, since it is highly expressed in subpopulation C, which is the dominant subpopulation in the tumor.
In response to the treatment, subpopulation A upregulates Gene X and now expresses it at an average log CPM of 0.3 (that’s a LOG2(0.3/0.1)=+1.58 LOG2 fold change). Similarly, subpopulation B upregulates Gene X to 1.8 (+0.26 LOG2 fold change) and subpopulation C upregulates gene X to 3.5 (+0.22 LOG2 fold change). However, using bulk RNA sequencing, because of the underlying shift in subpopulation proportion, what we would observe is an average expression of Gene X at 0.8*0.3+0.16*1.8+0.04*3.5=0.67. Now, the gene is fairly lowly expressed on average in the tumor because it is lowly expressed in subpopulation A, which is the new dominant subpopulation in the tumor. As you can see, this observed post-treatment average expression of Gene X is a decrease from our pre-treatment average, and with a not insignificant fold change of -1.98. Therefore, our bulk assay would lead us to believe that treatment with the drug results in downregulation of Gene X.
A summary of Expression of Gene X:
Pre-treatment | Post-treatment | Log2 Fold Change | |
---|---|---|---|
Subpopulation A | 0.10 | 0.30 | +1.58 |
Subpopulation B | 1.50 | 1.80 | +0.26 |
Subpopulation C | 3.00 | 3.50 | +0.22 |
Population Average | 2.64 | 0.67 | -1.98 |
So how can it be that each subpopulation INCREASED its expression of Gene X but on average we observe a DECREASE post-treatment? Herein lies Simpson’s Paradox. As elegantly stated by Cole Trapnell, “failing to properly compartmentalize the data by cell type leads to a qualitatively incorrect interpretation. The misleading effects of Simpson’s Paradox are likely to be pervasive in modern experiments using bulk assays.”
Of course, this is just a dummy hypothetical example. However, this happens in real biology as well. The best explicit example of Simpson’s Paradox I’ve seen in real biology (courtesy of Arjun Raj): linc-HOXA1 is a non-coding RNA that represses Hoxa1 in cis
- Older
- Newer
RECENT POSTS
- Impact of normalizing spatial transcriptomics data in dimensionality reduction and clustering versus deconvolution analysis with STdeconvolve on 04 May 2023
- Aligning Spatial Transcriptomics Data With Stalign on 16 April 2023
- 3D animation of the brain in R on 08 November 2022
- Ethical Challenges in Biomedical Engineering - Data Collection, Analysis, and Interpretation on 15 October 2022
- I use R to (try to) figure out the cost of medical procedures by analyzing insurance data from the Transparency in Coverage Final Rule on 12 September 2022
- Annotating STdeconvolve Cell-Types with ASCT+B Tables on 30 August 2022
- Deconvolution vs Clustering Analysis: An exploration via simulation on 11 July 2022
- Coloring SVGs in R on 17 June 2022
- Deconvolution vs Clustering Analysis for Multi-cellular Pixel-Resolution Spatially Resolved Transcriptomics Data on 03 May 2022
- Exploring UMAP parameters in visualizing single-cell spatially resolved transcriptomics data on 19 January 2022