Differential Gene Expression Analysis- Eevee vs. Pikachu

Hi everyone! I'm a master's student in BME. I work for a genomics company, so I'm excited to learn more about visualization of the data! In my spare time I love to play piano and hike with my dog.

Differential Gene Expression Analysis- Eevee vs. Pikachu

Description

There are a few reasons I am confident in saying that I picked the same cells. First, I looked at TACSTD2 expression in PC space, and it matched very well with where my cluster is highlighted. In my original cluster from HW3, I also had very high TACSTD2 expression. The matching of high TACSTD2 expression was true in physical space as well. I then looked at the top 3 differentially expressed genes in my cluster, which did differ from those in the Pikachu set. However, CEACAM6, which was one of the most commonly overexpressed genes in my cluster, is associated with Crohn’s disease, particularly as a marker of inflammation in gut epithelial cells [1], which is true for TACSTD2 as well. Additionally, the other two highly overexpressed genes, CPB1 and FXYD3, are both overexpressed in cancer cells [2], [3]. Since one of my hypotheses for my cluster from my Pikachu set was that it was gut epithelial cancer cells, I believe it is reasonable to assume that I have isolated the same cluster.

In terms of code changes, I had to edit how I was splicing the dataset because the columns between the Eevee and Pikachu datasets are different. After I changed the splicing, I re-did my K-means clustering. I chose 8 clusters this time rather than 12 because that is where the plateau appeared in my totw plot. I believe that there are fewer clusters in the Eevee dataset because the resolution is much lower, due to spot rather than per-cell analysis. Because each spot contains many cells, there is likely slightly less overall variation, thus resulting in fewer clusters.

I had some issues with my original PCA code stating that I had too many rows in one dataframe. However, upon closer inspection, it appeared that the code which was throwing errors was extraneous, so I simply deleted it. I then changed the names of the variables in my plots and in the code itself to be consistent (i.e. changing ‘Cluster 8’ to ‘Cluster 7’) for clarity in the plots, as well as for code legibility. Finally, I had originally matched my cluster to my cells based on cell index. This was not possible for the Eevee dataset since individual cells are not measured. Instead, I matched the barcodes of the spots to figure out which spots were within my cluster of interest.

https://www.ncbi.nlm.nih.gov/gene/4680
https://www.ncbi.nlm.nih.gov/gene/1360
https://www.genecards.org/cgi-bin/carddisp.pl?gene=FXYD3

Code

library(ggplot2)
library(patchwork)

file <- "data/eevee.csv.gz"
data <- read.csv(file)
data[1:5,1:10]

#A panel visualizing your one cluster of interest in reduced dimensional space (PCA, tSNE, etc)
#A panel visualizing your one cluster of interest in physical space
#A panel visualizing differentially expressed genes for your cluster of interest
#A panel visualizing one of these genes in reduced dimensional space (PCA, tSNE, etc)
#A panel visualizing one of these genes in space

pos <- data[, 3:4]
rownames(pos) <- data$cell_id
gexp <- data[, 5:ncol(data)]
rownames(gexp) <- data$barcode

## if you normalize, if you log transform
## you can use different transformations of the gene expression
## for different parts of the analysis
loggexp <- log10(gexp+1)

#Set seed for reproducability with chosen cluster
set.seed(42)
com <- kmeans(loggexp, centers=8)
clusters <- com$cluster
clusters <- as.factor(clusters) ## tell R it's a categorical variable
names(clusters) <- rownames(gexp)
head(clusters)

pcs <- prcomp(loggexp)
df <- data.frame(pcs$x, clusters)
data$PC1 <- pcs$x[,1]
data$PC2 <- pcs$x[,2]

#using this to pick my cluster of interest
ggplot(data) + 
  geom_point(aes(x = aligned_x, y=aligned_y, col=clusters)) +
  ggtitle("clusters in real space") +
  theme(plot.title = element_text(hjust=0.5)) +
  xlab("x alignment") + ylab("y alignment")

#Cluster 7 seems to be the most aligned to my original cluster
# in physical space
interest <- 7
cellsOfInterest <- names(clusters)[clusters == interest]
otherCells <- names(clusters)[clusters != interest]

# Create a new column for coloring
data$highlight <- ifelse(clusters == interest, "Cluster 7", "Other Cells")

p1 <- ggplot(data) + 
  geom_point(aes(x = PC1, y = PC2, col = highlight)) +
  ggtitle("Cluster 7 vs other cells in reduced dimension (PC) space") +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlab("PC1") + 
  ylab("PC2") +
  scale_color_manual(values = c("Cluster 7" = "blue", "Other Cells" = "gray"))

p2 <- ggplot(data) + 
  geom_point(aes(x = aligned_x, y = aligned_y, col = highlight)) +
  ggtitle("Cluster 7 vs Other Cells in Physical Space") +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlab("x alignment") + 
  ylab("y alignment") +
  scale_color_manual(values = c("Cluster 7" = "blue", "Other Cells" = "gray"))

#Differential gene expression
gexp_cluster7 <- loggexp[cellsOfInterest, ]  # Expression in Cluster 7
gexp_other <- loggexp[otherCells, ]          # Expression in other clusters

pvals <- sapply(1:ncol(loggexp), function(i) {
  wilcox.test(gexp_cluster7[, i], gexp_other[, i], exact = FALSE)$p.value
})

#correction as mentioned in class (i picked bonferroni)
adj_pvals <- p.adjust(pvals, method = "bonferroni")

#mean expression diff
logFC <- colMeans(gexp_cluster7) - colMeans(gexp_other)

# Create a dataframe of results
DE_genes <- data.frame(Gene = colnames(loggexp), 
                       logFC = logFC, pval = pvals, adj_pval = adj_pvals)

# Filter p-value < 0.05)
DE_genes <- DE_genes[DE_genes$adj_pval < 0.05, ]

# Sort by mean expression diff 
DE_genes <- DE_genes[order(DE_genes$logFC, decreasing = TRUE), ]

# Select the top 5 most differentially expressed genes
top5_genes <- DE_genes$Gene[1:5]  
print(top5_genes) 

# Subset expression data for the top 5 DE genes within Cluster 8
top5_expr <- loggexp[cellsOfInterest, top5_genes]

# Determine the most highly expressed gene per cell
most_expressed_gene <- apply(top5_expr, 1, function(x) top5_genes[which.max(x)])

# Ensure the spatial coordinates match the selected cells
cell_indices <- match(cellsOfInterest, data$barcode) 

# Create a dataframe for plotting
plot_data <- data.frame(
  cell_id = cellsOfInterest,
  aligned_x = data$aligned_x[cell_indices],
  aligned_y = data$aligned_y[cell_indices],
  top_gene = most_expressed_gene
)

p3 <- ggplot(plot_data) +
  geom_point(aes(x = aligned_x, y = aligned_y, color = top_gene)) +
  ggtitle("Most Expressed Top 3 DE Genes in Cluster 7") +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlab("x alignment") +
  ylab("y alignment") +
  labs(color = "Top DE Gene")

#it looks like TACSTD2 is the most preferentially expressed gene, 
# so I will further analyze that

# Extract TACSTD2 expression across all cells
tac_expression <- loggexp[, "TACSTD2"]

# Add TACSTD2 expression to the original data
data$tac_expression <- tac_expression

p4 <- ggplot(data, aes(x = PC1, y = PC2, color = tac_expression)) +
  geom_point() +
  scale_color_gradient(low = "blue", high = "red") +
  ggtitle("TACSTD2 Expression in PCA Space") +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlab("PC1") +
  ylab("PC2") +
  labs(color = "TACSTD2 Expression")

p5 <- ggplot(data, aes(x = aligned_x, y = aligned_y, color = tac_expression)) +
  geom_point() +
  scale_color_gradient(low = "blue", high = "red") +
  ggtitle("TACSTD2 Expression in Physical Space") +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlab("x alignment") +
  ylab("y alignment") +
  labs(color = "TACSTD2 Expression")


combined_plot <- (p1 | p2) / p3 / (p4 | p5)    

# Display the combined plot
combined_plot

20 Feb 2025

« hw 4 DEG analysis EC1: Comparing PCA and tSNE clustering methods with gganimate »