Multi-Panel Data Visualization of Breast Cancer Cell Cluster and Genes
Describe your figure briefly so we know what you are depicting (you no longer need to use precise data visualization terms as you have been doing).
In my data visualization, I first created a k-means cluster plot of the pikachu dataset with 10 centers based on a withiness plot. I then selected cluster 10 as my cluster of interest and plotted it in both physical space and in reduced dimensions, by PCA. I also did a wilcox test on cluster 10 and found the top 10 most differentially expressed genes in my cluster with a p-value less than 0.05 and created a bar graph visualizing what percent of the cells in my cluster expressed each of these genes. Finally, I selected one of these genes: CDH1, and plotted it in both physical space and reduced dimensions, again by PCA.
Write a description to convince me that your cluster interpretation is correct. Your description may reference papers and content that allowed you to interpret your cell cluster as a particular cell-type. You must provide attribution to external resources referenced. Links are fine; formatted references are not required.
The differential expression or upregulation of CDH1 within cluster 10 leads me to believe that this cluster of cells may be epithelial cells of breast cancer tissue. According to The Human Protein Atlas, CDH1 is predominantly expressed in epithelial cells in order to mintain tissue integrity, as it is involved in regulating cell-cell adhesions. The Atlas also points out that this gene is associated with breast cancer, suggesting it involvment in increased proliferation in cancer progression. My bar graph: Differentially Expressed Genes in Cluster 10, shows that over 80% of the cells in cluster 10 express this gene which strongly supports the inference that cluster 10 has breast cancer epithelial cells in it.
External Source: The Human Protein Altas (https://www.proteinatlas.org/ENSG00000039068-CDH1)
The entire code used to generate the figure so that it can be reproduced.
library(ggplot2)
library(Rtsne)
library(patchwork)
library(gganimate)
library(lattice)
library(ggrepel)
library(gridExtra)
data <- read.csv('/Users/shailitripathi/Downloads/GENOMIC/Homework/pikachu.csv.gz', row.names=1)
dim(data)
data[1:5, 1:10]
pos <- data[,4:5]
gexp <- data[,6:ncol(data)]
# NORMALIZE
gexpnorm <- log10(gexp/rowSums(gexp) * mean(rowSums(gexp)) + 1)
# PCA
pcs <- prcomp(gexpnorm)
plot(pcs$sdev[1:30], type = 'l') # plot variance of first 30 pcs
pc_plot <- ggplot(data.frame(pcs$x)) + geom_point(aes(x = PC1, y = PC2)) +
geom_hline(yintercept = 0, linetype = "solid", color = "black") +
geom_vline(xintercept = 0, linetype = "solid", color = "black") +
theme_minimal()
pc_plot
# K-MEANS
set.seed(123)
totw <- sapply(1:20, function(i) {
com <- kmeans(pcs$x[, 1:5], centers = i)
return(com$tot.withinss)
})
plot(totw, type = 'l')
com <- kmeans(pcs$x[, 1:5], centers = 10)
cluster_labels <- as.factor(com$cluster)
pca_cluster <- ggplot(data.frame(pcs$x[, 1:2], com = cluster_labels)) +
geom_point(aes(x = PC1, y = PC2, col = com), size = 0.5) +
labs(title = "Clusters in PCA Space") +
geom_hline(yintercept = 0, linetype = "solid", color = "black") +
geom_vline(xintercept = 0, linetype = "solid", color = "black") +
theme_minimal()
actual_cluster <- ggplot(data.frame(pos, com = cluster_labels)) +
geom_point(aes(x = aligned_x, y = aligned_y, col = com), size = 0.25) +
labs(title = "Clusters in Actual Space")
pca_cluster + actual_cluster
# CLUSTERS OF INTEREST
cluster_data <- data[cluster_labels==4, ]
cluster4 <- ggplot(cluster_data, aes(x = aligned_x, y = aligned_y)) +
geom_point(color = 'chartreuse3', size = 0.5) +
ggtitle("Cluster 1") +
theme_minimal()
cluster_colors <- ifelse(cluster_labels == 4, "Cluster 4", "Other Clusters")
cluster4_pca <- ggplot(data.frame(pcs$x[, 1:2], com = cluster_colors)) +
geom_point(aes(x = PC1, y = PC2, color = com), size = 0.5) +
labs(title = "Cluster 1 in PCA Space") +
geom_hline(yintercept = 0, linetype = "solid", color = "black") +
geom_vline(xintercept = 0, linetype = "solid", color = "black") +
scale_color_manual(values = c("Cluster 4" = "chartreuse3", "Other Clusters" = "gray")) +
theme_minimal()
# DIFFERENTIALLY EXPRESSED GENES - Wilcox test
results <- sapply(colnames(gexpnorm), function(g) {
wilcox.test(gexpnorm[cluster_labels == 4, g],
gexpnorm[cluster_labels != 4, g],
alternative = "greater")$p.val
})
results
names(sort(results[results < 0.05], decreasing = FALSE))[1:10]
top_10_genes <- names(sort(results[results < 0.05], decreasing = FALSE))[1:10]
percent_expression <- sapply(top_10_genes, function(gene) {
mean(cluster_data[gene] > 0) * 100
})
plot_data <- data.frame(
Gene = top_10_genes,
Percent_Expression = percent_expression
)
dif_exp <- ggplot(plot_data, aes(x = Gene, y = Percent_Expression)) +
geom_bar(stat = "identity", fill = "chartreuse4") +
labs(x = "Top 10 Genes", y = "% of Cells Expressing Gene", title = "Differentially Expressed Genes in Cluster 1") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
dif_exp
# CDH1
CDH1_data <- data.frame(data, cluster_labels, gene=gexpnorm[, 'CDH1'])
CDH1 <- ggplot(CDH1_data, aes(x = aligned_x, y = aligned_y, color = CDH1)) +
geom_point(size = 0.5) +
ggtitle("CDH1") +
scale_color_gradient(low = "white", high = "chartreuse4" ) +
theme_minimal()
CDH1_colors <- ifelse(gexpnorm[, 'CDH1'] > 0, "CDH1", "No CDH1")
CDH1_pca <- ggplot(data.frame(pcs$x[, 1:2], com = CDH1_colors)) +
geom_point(aes(x = PC1, y = PC2, color = com), size = 0.5) +
labs(title = "CDH1 in PCA Space") +
geom_hline(yintercept = 0, linetype = "solid", color = "black") +
geom_vline(xintercept = 0, linetype = "solid", color = "black") +
scale_color_manual(values = c("CDH1" = "chartreuse4", "No CDH1" = "gray")) +
theme_minimal()
# DATA VISUALIZATION
lay <- rbind(c(1,2),
c(3,4),
c(5, 5),
c(6, 7))
grid.arrange(actual_cluster, pca_cluster,
cluster4, cluster4_pca,
dif_exp,
CDH1, CDH1_pca,
layout_matrix = lay,
top = "Gene Expression in Pickachu Data")