Multi-Panel Data Visualization of Breast Cancer Cell Cluster and Genes


Shaili Tripathi
I am an undergraduate junior majoring in biomedical engineering and minoring in computer science.

Multi-Panel Data Visualization of Breast Cancer Cell Cluster and Genes

Describe your figure briefly so we know what you are depicting (you no longer need to use precise data visualization terms as you have been doing).

In my data visualization, I first created a k-means cluster plot of the pikachu dataset with 10 centers based on a withiness plot. I then selected cluster 10 as my cluster of interest and plotted it in both physical space and in reduced dimensions, by PCA. I also did a wilcox test on cluster 10 and found the top 10 most differentially expressed genes in my cluster with a p-value less than 0.05 and created a bar graph visualizing what percent of the cells in my cluster expressed each of these genes. Finally, I selected one of these genes: CDH1, and plotted it in both physical space and reduced dimensions, again by PCA.

The differential expression or upregulation of CDH1 within cluster 10 leads me to believe that this cluster of cells may be epithelial cells of breast cancer tissue. According to The Human Protein Atlas, CDH1 is predominantly expressed in epithelial cells in order to mintain tissue integrity, as it is involved in regulating cell-cell adhesions. The Atlas also points out that this gene is associated with breast cancer, suggesting it involvment in increased proliferation in cancer progression. My bar graph: Differentially Expressed Genes in Cluster 10, shows that over 80% of the cells in cluster 10 express this gene which strongly supports the inference that cluster 10 has breast cancer epithelial cells in it.

External Source: The Human Protein Altas (https://www.proteinatlas.org/ENSG00000039068-CDH1)

The entire code used to generate the figure so that it can be reproduced.

library(ggplot2)
library(Rtsne)
library(patchwork)
library(gganimate)
library(lattice)
library(ggrepel)
library(gridExtra)

data <- read.csv('/Users/shailitripathi/Downloads/GENOMIC/Homework/pikachu.csv.gz', row.names=1)
dim(data)
data[1:5, 1:10]
pos <- data[,4:5]
gexp <- data[,6:ncol(data)]

# NORMALIZE

gexpnorm <- log10(gexp/rowSums(gexp) * mean(rowSums(gexp)) + 1)


# PCA 
pcs <- prcomp(gexpnorm)
plot(pcs$sdev[1:30], type = 'l') # plot variance of first 30 pcs
pc_plot <- ggplot(data.frame(pcs$x)) + geom_point(aes(x = PC1, y = PC2)) +
  geom_hline(yintercept = 0, linetype = "solid", color = "black") +
  geom_vline(xintercept = 0, linetype = "solid", color = "black") + 
  theme_minimal() 
pc_plot


# K-MEANS

set.seed(123)

totw <- sapply(1:20, function(i) {
  com <- kmeans(pcs$x[, 1:5], centers = i)
  return(com$tot.withinss)
})
plot(totw, type = 'l')

com <- kmeans(pcs$x[, 1:5], centers = 10)
cluster_labels <- as.factor(com$cluster)

pca_cluster <- ggplot(data.frame(pcs$x[, 1:2], com = cluster_labels)) +
  geom_point(aes(x = PC1, y = PC2, col = com), size = 0.5) +
  labs(title = "Clusters in PCA Space") + 
  geom_hline(yintercept = 0, linetype = "solid", color = "black") +
  geom_vline(xintercept = 0, linetype = "solid", color = "black") + 
  theme_minimal() 

actual_cluster <- ggplot(data.frame(pos, com = cluster_labels)) +
  geom_point(aes(x = aligned_x, y = aligned_y, col = com), size = 0.25) +
  labs(title = "Clusters in Actual Space") 

pca_cluster + actual_cluster

# CLUSTERS OF INTEREST 

cluster_data <- data[cluster_labels==4, ]
cluster4 <- ggplot(cluster_data, aes(x = aligned_x, y = aligned_y)) +
  geom_point(color = 'chartreuse3', size = 0.5) + 
  ggtitle("Cluster 1") + 
  theme_minimal()

cluster_colors <- ifelse(cluster_labels == 4, "Cluster 4", "Other Clusters")
cluster4_pca <- ggplot(data.frame(pcs$x[, 1:2], com = cluster_colors)) +
  geom_point(aes(x = PC1, y = PC2, color = com), size = 0.5) +
  labs(title = "Cluster 1 in PCA Space") + 
  geom_hline(yintercept = 0, linetype = "solid", color = "black") +
  geom_vline(xintercept = 0, linetype = "solid", color = "black") + 
  scale_color_manual(values = c("Cluster 4" = "chartreuse3", "Other Clusters" = "gray")) +
  theme_minimal()

# DIFFERENTIALLY EXPRESSED GENES - Wilcox test

results <- sapply(colnames(gexpnorm), function(g) {
  wilcox.test(gexpnorm[cluster_labels == 4, g], 
              gexpnorm[cluster_labels != 4, g], 
              alternative = "greater")$p.val
})
results
names(sort(results[results < 0.05], decreasing = FALSE))[1:10]
top_10_genes <- names(sort(results[results < 0.05], decreasing = FALSE))[1:10]
percent_expression <- sapply(top_10_genes, function(gene) {
  mean(cluster_data[gene] > 0) * 100
})
plot_data <- data.frame(
  Gene = top_10_genes,
  Percent_Expression = percent_expression
)

dif_exp <- ggplot(plot_data, aes(x = Gene, y = Percent_Expression)) +
  geom_bar(stat = "identity", fill = "chartreuse4") +
  labs(x = "Top 10 Genes", y = "% of Cells Expressing Gene", title = "Differentially Expressed Genes in Cluster 1") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 
dif_exp


# CDH1

CDH1_data <- data.frame(data, cluster_labels, gene=gexpnorm[, 'CDH1'])
CDH1 <- ggplot(CDH1_data, aes(x = aligned_x, y = aligned_y, color = CDH1)) +
  geom_point(size = 0.5) +
  ggtitle("CDH1") + 
  scale_color_gradient(low = "white", high = "chartreuse4" ) +
  theme_minimal()

CDH1_colors <- ifelse(gexpnorm[, 'CDH1'] > 0, "CDH1", "No CDH1")
CDH1_pca <- ggplot(data.frame(pcs$x[, 1:2], com = CDH1_colors)) +
  geom_point(aes(x = PC1, y = PC2, color = com), size = 0.5) +
  labs(title = "CDH1 in PCA Space") + 
  geom_hline(yintercept = 0, linetype = "solid", color = "black") +
  geom_vline(xintercept = 0, linetype = "solid", color = "black") + 
  scale_color_manual(values = c("CDH1" = "chartreuse4", "No CDH1" = "gray")) +
  theme_minimal()


# DATA VISUALIZATION

lay <- rbind(c(1,2),
             c(3,4), 
             c(5, 5), 
             c(6, 7))

grid.arrange(actual_cluster, pca_cluster,
             cluster4, cluster4_pca,
             dif_exp,
             CDH1, CDH1_pca,
             layout_matrix = lay,
             top = "Gene Expression in Pickachu Data")