HW1
1. What about the data would you like to make salient?
- The visualization is designed to make spatial differences in gene expression patterns salient.
- By placing Aqp1 and Nphs2 in side-by-side spatial maps, the figure emphasizes where each gene is expressed within the tissue.
- This layout facilitates direct visual comparison of spatial localization rather than numerical magnitude.
- The goal is to highlight the anatomical compartmentalization of kidney cell types through spatial expression patterns
2. What data types are represented?
- Spatial data: x and y coordinates encode physical locations of cells in the tissue.
- Quantitative data: log-transformed gene expression values (log1p(expr)).
- Categorical data: gene identity (Aqp1 vs Nphs2) used for faceting.
- Relational data: the relationship between gene expression and spatial location
3. What data encodings (geometric primitives and visual channels) are used?
Geometric primitives
- Points: each point represents an individual cell. Visual channels
- Position (x, y): encodes spatial location (quantitative).
- Hue: encodes gene expression magnitude (quantitative).
- Faceting (enclosure): separates genes into distinct panels for comparison.
4. What Gestalt principles / perceptual principles are used?
- Similarity: cells with similar colors are perceived as having similar expression levels.
- Proximity: spatially nearby cells are perceived as related anatomical structures.
- Enclosure: faceting encloses each gene in its own panel, reinforcing separation by gene identity.
- Continuity: smooth color gradients support perception of spatial expression trends rather than noise
Together, these principles enhance salience and processing, making spatial expression patterns easier to interpret
5. Code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
library(ggplot2)
library(dplyr)
library(tidyr)
library(scales)
library(viridis)
data <- read.csv(
"/Users/pmusenge/Desktop/genomic_data_vis/genomic-data-visualization-2026/data/Xenium-IRI-ShamR_matrix.csv.gz"
)
## sanity checks
dim(data)
head(data[, 1:5])
## Extracting spatial coordinates
pos <- data[, c("x", "y")]
rownames(pos) <- data[, 1]
## Extracting gene expression matrix
gexp <- data[, 4:ncol(data)]
rownames(gexp) <- data[, 1]
dim(gexp)
head(gexp[, 1:5])
## Pair of genes to analyze
geneA <- "Nphs2" # podocyte
geneB <- "Aqp1" # proximal tubule (Megalin)
## Checking genes exist in data
if (!(geneA %in% colnames(gexp))) stop(paste("Missing gene:", geneA))
if (!(geneB %in% colnames(gexp))) stop(paste("Missing gene:", geneB))
## Building plotting dataframe
df <- data.frame(
x = pos$x,
y = pos$y,
A = gexp[, geneA],
B = gexp[, geneB]
)
## Help from ChatGPT
## prompt: I have an R data frame called df that contains spatial coordinates (x, y) and gene expression counts for two genes stored in columns A and B. I want to Log-transform both gene expression columns using log1p and Compute a simple dominance score defined as log1p(A) − log1p(B)
## 8) Transforming + defining a simple “dominance” score
## log1p = log(1 + count), robust for sparse counts
df <- df %>%
mutate(
A_log = log1p(A),
B_log = log1p(B),
score = A_log - B_log, # >0 => A dominates, <0 => B dominates
sumAB = A_log + B_log
)
## thresholds to play with/tune
thresh_sum <- 0.25 # how much total expression to count as “present”
thresh_dom <- 0.25 # how strong the dominance needs to be
df <- df %>%
mutate(
category = case_when(
sumAB < thresh_sum ~ "Neither/Low",
score >= thresh_dom ~ paste0(geneA, " dominant"),
score <= -thresh_dom ~ paste0(geneB, " dominant"),
TRUE ~ "Both / Mixed"
),
category = factor(
category,
levels = c(paste0(geneA, " dominant"), "Both / Mixed", paste0(geneB, " dominant"), "Neither/Low")
)
)
##Visualization: Side-by-side spatial maps for each gene
df_long <- df %>%
select(x, y, A_log, B_log) %>%
rename(!!geneA := A_log, !!geneB := B_log) %>%
pivot_longer(cols = c(all_of(geneA), all_of(geneB)),
names_to = "gene", values_to = "expr_log")
p4 <- ggplot(df_long, aes(x = x, y = y, color = expr_log)) +
geom_point(alpha = 0.8, size = 0.6) +
scale_color_viridis_c(name = "log1p(expr)") +
coord_fixed() +
facet_wrap(~ gene, ncol = 2) +
labs(
title = "Spatial expression (side-by-side)",
subtitle = "Two separate maps often communicate best"
) +
theme_minimal()
print(p4)