Deconvolution of Eevee Dataset for Identification of the Epithelial Cell Type

Hi everyone! My name is Meera Bhat, and I am a BME senior and master's student. I am excited to join this class and learn more about genomic data visualization as I have been working in Dr. Rong Fan's lab at Yale University on spatial omics sequencing and transcriptomics with senescent tissues and would love to learn more. In my free time, I love to read, paint my nails, cook, and dance!

Deconvolution of Eevee Dataset for Identification of the Epithelial Cell Type

In this study, I analyzed the Eevee spatial transcriptomics dataset using deconvolution techniques and clustering methods to identify distinct cell types and visualize their gene expression patterns. The dataset was processed using STdeconvolve for deconvolution and k-means clustering to analyze spatial organization, followed by differential expression analysis to identify upregulated genes. I focused on cell type 9, which exhibited high expression of epithelial cell markers, including KRT17, KRT6B, KLK5, LAMB3, COL17A1, and LAMC2. This selection was informed by my previous work in HW4, where I identified an epithelial cell population as a key component within the dataset.

When comparing my results to previous analyses, I observed notable similarities and improvements. In HW4, k-means clustering was applied to identify spatially distinct cell clusters, revealing a heterogeneous distribution of cell types. In this analysis, I optimized k-means clustering with k = 9, which aligned with prior findings but included t-SNE visualization for refined cluster identification. The epithelial-associated cluster (cluster 9) showed a similar spatial distribution to HW4’s results, reinforcing its biological significance. However, the use of STdeconvolve for deconvolution provided a more nuanced perspective by estimating the proportion of multiple cell types within each spatial location. The scatterbar plot visualization highlighted the localized enrichment of cell type 9, confirming its spatial pattern in the tissue, a finding that complements previous clustering-based observations.

My differential gene expression analysis further supported these insights. The volcano plot revealed key differentially expressed genes within cell type 9, with SCNN1A, RHOV, CGN, COL17A1, FAM83B, LAMB3, and ANXA3 identified as significantly upregulated. These genes reinforce the epithelial identity of the cluster as confirmed by the Human Protein Atlas, suggesting a role in epithelial differentiation and function. Conversely, downregulated genes, including CXCL12, FBLN1, COL14A1, SELENOP, PLTP, RNASE1, and C1QA, indicate potential microenvironmental influences on the epithelial cell state. Compared to my HW4 analysis, deconvolution-based separation of mixed populations allowed for higher specificity in distinguishing transcriptional activity within the identified cell type.

The data visualizations effectively illustrate my findings. The scatterbar plot of deconvolved cell-type proportions showcases the spatial distribution of cell type 9, revealing its presence in specific tissue regions. The deconvolved gene expression plot further highlights the spatial pattern of epithelial marker expression, confirming the structured localization of these cells. However, across all cell types, it was observed that gene expression of epithelial cells were only strongly expressed in cell types 8 and 9. The k-means clustering visualization, using t-SNE, demonstrates how the epithelial-associated cluster aligns with the deconvolved map, reinforcing its consistency across different analytical approaches. Finally, the volcano plot captures differentially expressed genes, emphasizing key upregulated and downregulated genes associated with cell type 9, helping to pinpoint molecular markers defining this population. Therefore, this approach improves the resolution of spatial transcriptomic analyses and enhances our understanding of tissue heterogeneity, offering a more detailed perspective on cellular organization and gene expression in complex tissues.

Sources: https://www.proteinatlas.org/ENSG00000143375-CGN https://www.proteinatlas.org/ENSG00000138772-ANXA3 https://www.proteinatlas.org/ENSG00000196878-LAMB3 https://www.proteinatlas.org/ENSG00000168143-FAM83B https://www.proteinatlas.org/ENSG00000065618-COL17A1 https://www.proteinatlas.org/

The code used to generate this visualization is as follows:

# Necessary libraries loaded
library(ggplot2)
library(patchwork)
library(Rtsne)
library(ggrepel)
library(factoextra)
library(NMF)
library(cluster)
library(pheatmap)
library(grid)
library(reshape2)
library(dplyr)
library(tidyverse)


# Loads Eevee dataset
file <- '~/Desktop/genomic-data-visualization-2025/data/eevee.csv.gz'
data <- read.csv(file, row.names = 1)

# Extracts spatial coordinates and gene expression data from aligned_x and aligned_y columns
pos <- data[, 2:3]
#rownames(pos) <- data$cell_id
gexpsub <- data[, 4:ncol(data)]
#rownames(gexpsub) <- data$cell_id

# Log-transforms gene expression data and normalizes gexp
normgexpsub <- log10(gexpsub/rowSums(gexpsub) * mean(rowSums(gexpsub)) + 1)

# Chooses top most variable genes to ensure KRT17 is included
variable_genes <- apply(normgexpsub, 2, var)
top_genes <- names(sort(variable_genes, decreasing = TRUE)[1:1000])

# Defines key epithelial cell markers that must be included
epi_cell_markers <- c("KRT6B", "KRT17", "KLK5", "LAMB3", "COL17A1", "LAMC2")
print(paste("These are the key epithelial cell markers to include:", paste(epi_cell_markers, collapse = ", ")))

# Checks missing markers from the top 1000
miss_markers <- setdiff(epi_cell_markers, top_genes)
print(paste("These are the key epithelial cell markers missing:", paste(miss_markers, collapse = ", ")))

# If all markers are in the top 1000, use them directly
if (length(miss_markers) == 0) {
  import_genes <- top_genes  # Use top 1000 genes directly
  print("All epithelial cell markers are in the top 1000 genes.")
} else {
  # Select the top 1000 minus the number of missing markers (to maintain size = 1000)
  top_select <- names(sort(variable_genes, decreasing = TRUE)[1:(1000 - length(miss_markers))])
  
  # Add the missing markers
  import_genes <- unique(c(top_select, miss_markers))
  print(paste("The following epithelial cell markers were added:", paste(miss_markers, collapse=", ")))
}

norm_gexp_sub_filter <- normgexpsub[, import_genes]

# Performs PCA on gene expression data
pcs <- prcomp(norm_gexp_sub_filter)
plot(pcs$sdev[1:40])
plot(pcs$sdev[1:20])
plot(pcs$sdev[1:10])


# Finds optimal PCs for kMeans
plot(1:20, pcs$sdev[1:20], type = "l")
plot(1:10, pcs$sdev[1:10], type = "l")

# 6 PCs accounts for meaningful variance of the eevee dataset
pc_opt <- 6

# Perform t-SNE on PCA-reduced gene expression data
set.seed(5)  # Ensures reproducibility

# Determine optimal number of k's (centroids) for kMeans using elbow method
elbow <- fviz_nbclust(norm_gexp_sub_filter, kmeans, method = "wss") + 
  geom_vline(xintercept = 5, linetype = 2) + 
  labs(subtitle = "Elbow Method")

elbow 

# 9 cluster k's accounts for meaningful variance of the eevee dataset
cluster <- 9

# Perform k-means clustering with k = 9
tsne_emb <- Rtsne(pcs$x[,1:pc_opt])$Y
com <- kmeans(norm_gexp_sub_filter, centers = 9)  # Example k-means clustering

# Creates a data frame for t-SNE visualization
df_tsne <- data.frame(tsne_emb)
colnames(df_tsne) <- c("tSNE1", "tSNE2")
#df_tsne$clusters <- as.factor(com$cluster)

# Assigns clusters to data frames
df_clusters <- data.frame(
  aligned_x = pos$aligned_x, 
  aligned_y = pos$aligned_y,
  tSNE1 = tsne_emb[, 1],  
  tSNE2 = tsne_emb[, 2], 
  kmeans = as.factor(com$cluster)  # Ensure `com` is a kmeans object
)

## Cluster Identification

# Computes mean expression of KRT17 per cluster
mean_cluster <- tapply(normgexpsub[, "KRT17"], com$cluster, mean)

# Identify the cluster with the highest KRT17 expression using quantitative variable cluster_of_interest
cluster_of_interest <- names(which.max(mean_cluster))
print(paste("Cluster with highest KRT17 expression:", cluster_of_interest)) # Cluster of Interest: 5

# Compute mean expression for other key epithelial cell markers across clusters
epi_cell_markers <- c("KRT6B", "KLK5", "LAMB3", "COL17A1", "LAMC2")

# Create a data frame to store mean expression of each marker per cluster
mean_cluster <- data.frame(cluster = levels(com)) 
#for (gene in epi_cell_markers) {
  #mean_cluster[[gene]] <- tapply(normgexpsub[, "KRT17"], com, mean, na.rm = TRUE)
#}
#print(mean_cluster)

mean_cluster$cluster <- as.character(mean_cluster$cluster)

# Identify cluster with highest combined expression for epithelial cell markers
highest_epi_cell_markers_cluster <- mean_cluster$cluster[which.max(rowMeans(mean_cluster[, epi_cell_markers], na.rm = TRUE))]
print(paste("Cluster with highest overall epithelial cell marker expression:", highest_epi_cell_markers_cluster))
print(paste("Cluster of Interest:", cluster_of_interest))

# Add cluster highlight for visualization
df_clusters$highlight <- ifelse(df_clusters$kmeans == cluster_of_interest, "Cluster of Interest", "Other")

# Store 'com' as a factor
df_clusters$kmeans <- as.factor(df_clusters$kmeans)

# Create a new column to highlight only the selected cluster
df_clusters$highlight <- ifelse(df_clusters$kmeans == cluster_of_interest, "Cluster of Interest", "Other")

# Check if 'highlight' column was created correctly
table(df_clusters$highlight)

# Extract physical coordinates x, y positions
df_clusters$aligned_x <- data[,2]  
df_clusters$aligned_y <- data[,3]  

# Spatial plot of Cluster of Interest
p1 <- ggplot(df_clusters) + 
  geom_point(aes(x = aligned_x, y = aligned_y, 
                 col = ifelse(kmeans == cluster_interest, "Cluster of Interest", "All Other Cells")),
             size = 1) +
  theme_minimal() +
  labs(title = "Physical Spatial Distribution of Cluster of Interest",
       x = "Physical X-position", y = "Physical Y-Position", color = "Clustering on tSNE") +
  theme(
    plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 8),
    axis.text = element_text(size = 6)
  ) + guides(color = guide_legend(override.aes = list(size = 2)))  + 
  scale_color_manual(values = c("lightgreen", "purple"))


p1

# Visualizes differentially expressed genes for cluster of interest, Cluster 1
com <- kmeans(normgexpsub, centers = 9)  # Example k-means clustering
com_categories <- as.factor(com$cluster)  # Extract and convert to factor

cluster_interest = 5

p_value = sapply(colnames(normgexpsub), function(i){
  print(i)
  wilcox.test(normgexpsub[com_categories == cluster_interest, i], normgexpsub[com_categories!= cluster_interest, i])$p.val
})

logfc = sapply(colnames(normgexpsub), function(i){
  print(i)
  log2(mean(normgexpsub[com_categories == cluster_interest, i])/mean(normgexpsub[com_categories != cluster_interest, i]))
})


# Creates a volcano plot dataframe
df_volcano <- data.frame(
  p_value = -log10(p_value + 1e-100), 
  logfc, 
  genes = colnames(normgexpsub)
)

# Removes NA values from df_volcano to reduce warning message error
df_volcano <- df_volcano[complete.cases(df_volcano), ]

# Adjusts threshold for gene labeling to avoid excessive overlap of labels
df_volcano$gene_label <- ifelse(
  df_volcano$p_value > 10 & abs(df_volcano$logfc) > 1.5,
  as.character(df_volcano$genes), 
  NA
)

#Assigns number of labels as 7 to allow for proper labeling without excessive overlap
num_labels_down <- 7  
num_labels_up <- 7

top_down <- df_volcano %>% filter(logfc < -1) %>% top_n(num_labels_down, wt = p_value)
top_up <- df_volcano %>% filter(logfc > 1) %>% top_n(num_labels_up, wt = p_value)

df_volcano$gene_label <- ifelse(
  df_volcano$genes %in% c(top_down$genes, top_up$genes), 
  as.character(df_volcano$genes), 
  NA
)

# Generates volcano plot
volcano_plot <- ggplot(df_volcano) + 
  geom_point(aes(x = logfc, y = p_value, col = ifelse(logfc > 0, 'Upregulated', 'Downregulated'))) +
  geom_label_repel(aes(x = logfc, y = p_value, label = gene_label), box.padding = 0.5,  
                   point.padding = 0.5, segment.color = 'grey50', max.overlaps = 30, min.segment.length = 0.5,
                   force = 2, size = 3) + 
  ylim(0, max(df_volcano$p_value) + 5) + geom_hline(yintercept = -log10(0.05), linetype = "dashed") + 
  geom_vline(xintercept = c(-1.5, 1.5), linetype = "dashed") +
  labs(col = "Gene Expression\nIn Cluster of Interest",
       title = "Volcano Plot Representation of Gene Expression in Cluster of Interest",
       x = "Log(Fold Change)", y = "-log10(P-value)") +
  theme(plot.title = element_text(face = "bold")) +
  scale_color_manual(values = c("purple", "lightgreen")) +
  theme(plot.title = element_text(face = "bold"),
        aspect.ratio = 0.25)  # Decrease the aspect ratio to make the plot wider

# Displays the volcano plot to identify clusters of downregulated and upregulated genes
volcano_plot


# Performs deconvolution using STdeconvolve (NMF)  
st_deconvolve <- function(data, cluster) {
  nmf_res <- nmf(data, rank = cluster, method = "lee", seed = 42, nrun = 1, .options = "NMFStrategy")
  cell_type_proportions <- basis(nmf_res)
  gene_signatures <- coef(nmf_res)
  
  return(list(cell_type_proportions = as.data.frame(cell_type_proportions),
              gene_signatures = as.data.frame(gene_signatures)))
}
# Runs STdeconvolve for k = 9
k <- 9
results <- st_deconvolve(norm_gexp_sub_filter, k)

# Extract expression values for epithelial cell markers
gene_signatures <- as.data.frame(results$gene_signatures)
epi_cell_signatures <- gene_signatures[, epi_cell_markers, drop = FALSE]
print(epi_cell_signatures)

# Focuses on Cell Type 9
celltype <- 9

# Scatterbar visualization: Spatial Scatter plot shows deconvolved cell-type proportions across tissue spots.
#
# Display deconvolved cell-type proportions for K=9
print("Deconvolved Cell-Type Proportions (K=9)")
print(head(results$cell_type_proportions))
df_plot <- data.frame(
  aligned_x = data[, 2], 
  aligned_y = data[, 3],
  Deconvolved_CellType = results$cell_type_proportions[, celltype]
)

p3 <- ggplot(df_plot, aes(x=aligned_x, y=aligned_y, color=Deconvolved_CellType)) +
  geom_tile(size=9) +
  scale_color_viridis_c() +
  labs(title=paste("Scatterbar of Cell Type", celltype),
       x="X-coordinate", 
       y="Y-coordinate",
       col = "Cell Type") +
  theme_minimal()

p3

# Heatmap of gene signatures: Displays the top marker genes defining each inferred cell type.
# Add some broader known epithelial cell genes to observe
epi_cell_genes <- c("KRT6B", "KRT17", "KLK5", "LAMB3", "COL17A1", "LAMC2")
print(epi_cell_genes %in% colnames(gene_signatures))

# Subsets the NMF results to show only these genes
epi_cell_heatmap_data <- gene_signatures[, intersect(colnames(gene_signatures), epi_cell_genes), drop = FALSE]

# Converts row names (cell types) into a proper column
epi_cell_heatmap_data <- as.data.frame(epi_cell_heatmap_data)
epi_cell_heatmap_data$CellType <- rownames(epi_cell_heatmap_data)

# Melts the data, treating "CellType" as an identifier
epi_cell_heatmap_melted <- melt(epi_cell_heatmap_data, id.vars = "CellType", 
                              variable.name = "Gene", value.name = "Expression")

# Generate heatmap for epithelial-cell markers
p4 <- ggplot(epi_cell_heatmap_melted, aes(x = Gene, y = CellType, fill = Expression)) +
  geom_tile() +
  scale_fill_viridis_c(name = "Expression Level") +  # Custom legend title
  theme_minimal() +
  labs(title = "Epithelial Cell Markers Across Deconvolved Cell Types",
       x = "Genes", y = "Cell Type") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

p4

## Additional Visualizations ##
# Visualizing Scatterbars for the other 8 deconvolved cell types for comparison
# Cell Type 1
df_plotx <- data.frame(
  aligned_x = data[, 2], 
  aligned_y = data[, 3],
  Deconvolved_CellType = results$cell_type_proportions[, 1]
)
p5 <- ggplot(df_plotx, aes(x=aligned_x, y=aligned_y, color=Deconvolved_CellType)) +
  geom_tile(size=9) +
  scale_color_viridis_c() +
  labs(title=paste("Scatterbar of Cell Type", 1),
       x="X-coordinate", 
       y="Y-coordinate") +
  theme_minimal() +
  theme(legend.text = element_text(size = 8),  
        legend.title = element_text(size = 8),
        legend.key.size = unit(0.3, "cm"),
        axis.text.x = element_text(size = 6))

p5

# Cell Type 2
df_plotx <- data.frame(
  aligned_x = data[, 2], 
  aligned_y = data[, 3],
  Deconvolved_CellType = results$cell_type_proportions[, 2]
)
p6 <- ggplot(df_plotx, aes(x=aligned_x, y=aligned_y, color=Deconvolved_CellType)) +
  geom_tile(size=9) +
  scale_color_viridis_c() +
  labs(title=paste("Scatterbar of Cell Type", 2),
       x="X-coordinate", 
       y="Y-coordinate") +
  theme_minimal() +
  theme(legend.text = element_text(size = 8),  
        legend.title = element_text(size = 8),
        legend.key.size = unit(0.3, "cm"),
        axis.text.x = element_text(size = 6))

p6

# Cell Type 3
df_plotx <- data.frame(
  aligned_x = data[, 2], 
  aligned_y = data[, 3],
  Deconvolved_CellType = results$cell_type_proportions[, 3]
)
p7 <- ggplot(df_plotx, aes(x=aligned_x, y=aligned_y, color=Deconvolved_CellType)) +
  geom_tile(size=9) +
  scale_color_viridis_c() +
  labs(title=paste("Scatterbar of Cell Type", 3),
       x="X-coordinate", 
       y="Y-coordinate") +
  theme_minimal() +
  theme(legend.text = element_text(size = 8),  
        legend.title = element_text(size = 8),
        legend.key.size = unit(0.3, "cm"),
        axis.text.x = element_text(size = 6))
p7

# Cell Type 4
df_plotx <- data.frame(
  aligned_x = data[, 2], 
  aligned_y = data[, 3],
  Deconvolved_CellType = results$cell_type_proportions[, 4]
)
p8 <- ggplot(df_plotx, aes(x=aligned_x, y=aligned_y, color=Deconvolved_CellType)) +
  geom_tile(size=9) +
  scale_color_viridis_c() +
  labs(title=paste("Scatterbar of Cell Type", 4),
       x="X-coordinate", 
       y="Y-coordinate") +
  theme_minimal() +
  theme(legend.text = element_text(size = 8),  
        legend.title = element_text(size = 8),
        legend.key.size = unit(0.3, "cm"),
        axis.text.x = element_text(size = 6))
p8

# Cell Type 5
df_plotx <- data.frame(
  aligned_x = data[, 2], 
  aligned_y = data[, 3],
  Deconvolved_CellType = results$cell_type_proportions[, 5]
)
p9 <- ggplot(df_plotx, aes(x=aligned_x, y=aligned_y, color=Deconvolved_CellType)) +
  geom_tile(size=9) +
  scale_color_viridis_c() +
  labs(title=paste("Scatterbar of Cell Type", 5),
       x="X-coordinate", 
       y="Y-coordinate") +
  theme_minimal() +
  theme(legend.text = element_text(size = 8),  
        legend.title = element_text(size = 8),
        legend.key.size = unit(0.3, "cm"),
        axis.text.x = element_text(size = 6))
p9

# Cell Type 6
df_plotx <- data.frame(
  aligned_x = data[, 2], 
  aligned_y = data[, 3],
  Deconvolved_CellType = results$cell_type_proportions[, 6]
)
p10 <- ggplot(df_plotx, aes(x=aligned_x, y=aligned_y, color=Deconvolved_CellType)) +
  geom_tile(size=9) +
  scale_color_viridis_c() +
  labs(title=paste("Scatterbar of Cell Type", 6),
       x="X-coordinate", 
       y="Y-coordinate") +
  theme_minimal() +
  theme(legend.text = element_text(size = 8),  
        legend.title = element_text(size = 8),
        legend.key.size = unit(0.3, "cm"),
        axis.text.x = element_text(size = 6))
p10

# Cell Type 7
df_plotx <- data.frame(
  aligned_x = data[, 2], 
  aligned_y = data[, 3],
  Deconvolved_CellType = results$cell_type_proportions[, 7]
)
p11 <- ggplot(df_plotx, aes(x=aligned_x, y=aligned_y, color=Deconvolved_CellType)) +
  geom_tile(size=9) +
  scale_color_viridis_c() +
  labs(title=paste("Scatterbar of Cell Type", 7),
       x="X-coordinate", 
       y="Y-coordinate") +
  theme_minimal() +
  theme(legend.text = element_text(size = 8),  
        legend.title = element_text(size = 8),
        legend.key.size = unit(0.3, "cm"),
        axis.text.x = element_text(size = 6))
p11

# Cell Type 8
df_plotx <- data.frame(
  aligned_x = data[, 2], 
  aligned_y = data[, 3],
  Deconvolved_CellType = results$cell_type_proportions[, 8]
)
p12 <- ggplot(df_plotx, aes(x=aligned_x, y=aligned_y, color=Deconvolved_CellType)) +
  geom_tile(size=9) +
  scale_color_viridis_c() +
  labs(title=paste("Scatterbar of Cell Type", 8),
       x="X-coordinate", 
       y="Y-coordinate") +
  theme_minimal() +
  theme(legend.text = element_text(size = 8),  
        legend.title = element_text(size = 8),
        legend.key.size = unit(0.3, "cm"),
        axis.text.x = element_text(size = 6))
p12

#plots
library(cowplot)

# First row: Scatterbar, heatmap, and spatial distribution
row1 <- plot_grid(p3, p4, p1, ncol = 3, rel_widths = c(3, 3, 3))

# Second row: **WIDER Volcano plot**
row2 <- plot_grid(volcano_plot, ncol = 1)  # Max width

# Third row: Scatterbars for other cell types
row3 <- plot_grid(p5, p11, p12, ncol = 3, rel_widths = c(3, 3, 3))

# Combine everything, **giving row2 (volcano plot) the most space**
final_plot <- plot_grid(row1, row2, row3, ncol = 1, rel_heights = c(4, 6, 4))  # Volcano is largest

# Add annotations
#final_plot <- final_plot + plot_annotation(tag_levels = "A")

# Display the updated figure
print(final_plot)


# Sources:
# https://www.datacamp.com/doc/r/cluster 
# https://rpkgs.datanovia.com/factoextra/
# https://ggrepel.slowkow.com/
# code-lesson-5.R
# code-lesson-6.R
# code-lesson-7.R
# code-lesson-8.R
# code-lesson-9.R
# code-lesson-10.R
# code-lesson-11.R
# code-lesson-12.R
# code-lesson-13.R
# https://www.statology.org/set-seed-in-r/
# https://www.appsilon.com/post/r-tsne
# https://www.datacamp.com/tutorial/pca-analysis-r
# https://www.datacamp.com/tutorial/k-means-clustering-r
# https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame
# https://stackoverflow.com/questions/21271449/how-to-apply-the-wilcox-test-to-a-whole-dataframe-in-r
# https://biostatsquid.com/volcano-plots-r-tutorial/
# https://www.geeksforgeeks.org/how-to-create-and-visualise-volcano-plot-in-r/
# https://sjmgarnier.github.io/viridis/reference/scale_viridis.html
# https://www.analyticsvidhya.com/blog/2021/01/in-depth-intuition-of-k-means-clustering-algorithm-in-machine-learning/
# https://www.rdocumentation.org/packages/patchwork/versions/1.3.0/topics/plot_layout
# https://jef.works/scatterbar/index.html
# https://jef.works/scatterbar/articles/using-scatterbar-with-spatial-experiment.html
# https://jef.works/scatterbar/articles/getting-started-with-scatterbars.html
# https://www.proteinatlas.org/search/
# https://cran.r-project.org/web/packages/NMF/vignettes/heatmaps.pdf
# https://nmf.r-forge.r-project.org/heatmaps.html

07 Mar 2025

HW EC2

« Visualization of t-SNE on Different PC Counts on Pikachu Dataset Using gganimate to understand different preprocessing methods »