HW1

BME PhD. Soccer fan, flying Cessnas, and all things data science.

1. What about the data would you like to make salient?

The visualization is designed to make spatial differences in gene expression patterns salient.
By placing Aqp1 and Nphs2 in side-by-side spatial maps, the figure emphasizes where each gene is expressed within the tissue.
This layout facilitates direct visual comparison of spatial localization rather than numerical magnitude.
The goal is to highlight the anatomical compartmentalization of kidney cell types through spatial expression patterns

2. What data types are represented?

Spatial data: x and y coordinates encode physical locations of cells in the tissue.
Quantitative data: log-transformed gene expression values (log1p(expr)).
Categorical data: gene identity (Aqp1 vs Nphs2) used for faceting.
Relational data: the relationship between gene expression and spatial location

3. What data encodings (geometric primitives and visual channels) are used?

Geometric primitives

Points: each point represents an individual cell. Visual channels
Position (x, y): encodes spatial location (quantitative).
Hue: encodes gene expression magnitude (quantitative).
Faceting (enclosure): separates genes into distinct panels for comparison.

4. What Gestalt principles / perceptual principles are used?

Similarity: cells with similar colors are perceived as having similar expression levels.
Proximity: spatially nearby cells are perceived as related anatomical structures.
Enclosure: faceting encloses each gene in its own panel, reinforcing separation by gene identity.
Continuity: smooth color gradients support perception of spatial expression trends rather than noise

Together, these principles enhance salience and processing, making spatial expression patterns easier to interpret

5. Code

library(ggplot2)
library(dplyr)
library(tidyr)
library(scales)
library(viridis)  


data <- read.csv(
  "/Users/pmusenge/Desktop/genomic_data_vis/genomic-data-visualization-2026/data/Xenium-IRI-ShamR_matrix.csv.gz"
)

## sanity checks
dim(data)
head(data[, 1:5])

## Extracting spatial coordinates 
pos <- data[, c("x", "y")]
rownames(pos) <- data[, 1]

## Extracting gene expression matrix
gexp <- data[, 4:ncol(data)]
rownames(gexp) <- data[, 1]

dim(gexp)
head(gexp[, 1:5])

## Pair of genes to analyze
geneA <- "Nphs2"  # podocyte
geneB <- "Aqp1"   # proximal tubule (Megalin)

## Checking genes exist in data
if (!(geneA %in% colnames(gexp))) stop(paste("Missing gene:", geneA))
if (!(geneB %in% colnames(gexp))) stop(paste("Missing gene:", geneB))

## Building plotting dataframe
df <- data.frame(
  x = pos$x,
  y = pos$y,
  A = gexp[, geneA],
  B = gexp[, geneB]
)

## Help from ChatGPT
## prompt: I have an R data frame called df that contains spatial coordinates (x, y) and gene expression counts for two genes stored in columns A and B. I want to Log-transform both gene expression columns using log1p and Compute a simple dominance score defined as log1p(A) − log1p(B)
## 8) Transforming + defining a simple “dominance” score
## log1p = log(1 + count), robust for sparse counts
df <- df %>%
  mutate(
    A_log = log1p(A),
    B_log = log1p(B),
    score = A_log - B_log,                 # >0 => A dominates, <0 => B dominates
    sumAB = A_log + B_log
  )


## thresholds to play with/tune
thresh_sum <- 0.25   # how much total expression to count as “present”
thresh_dom <- 0.25   # how strong the dominance needs to be

df <- df %>%
  mutate(
    category = case_when(
      sumAB < thresh_sum ~ "Neither/Low",
      score >=  thresh_dom ~ paste0(geneA, " dominant"),
      score <= -thresh_dom ~ paste0(geneB, " dominant"),
      TRUE ~ "Both / Mixed"
    ),
    category = factor(
      category,
      levels = c(paste0(geneA, " dominant"), "Both / Mixed", paste0(geneB, " dominant"), "Neither/Low")
    )
  )


##Visualization: Side-by-side spatial maps for each gene

df_long <- df %>%
  select(x, y, A_log, B_log) %>%
  rename(!!geneA := A_log, !!geneB := B_log) %>%
  pivot_longer(cols = c(all_of(geneA), all_of(geneB)),
               names_to = "gene", values_to = "expr_log")

p4 <- ggplot(df_long, aes(x = x, y = y, color = expr_log)) +
  geom_point(alpha = 0.8, size = 0.6) +
  scale_color_viridis_c(name = "log1p(expr)") +
  coord_fixed() +
  facet_wrap(~ gene, ncol = 2) +
  labs(
    title = "Spatial expression (side-by-side)",
    subtitle = "Two separate maps often communicate best"
  ) +
  theme_minimal()

print(p4)

28 Jan 2026

« Kidney Oxygen Depletion Visualization HW1 »