Data Driven Faculty Job Search

Jun 7, 2018


For many post-docs, the next step will be to apply to a faculty position. But which schools should we look into? Let’s take a data-driven approach to our job search!

Let’s say I am looking for a faculty position in the United States. I am primarily interested in doing research and writing grants to get funding to do this research. One data-driven approach to help narrow down the list of schools I may consider is to look for institutions that have been successful in getting grants.

Luckily, the NIH has an excellent database of all the federally funded grants in the past decade! They even have an API! But the data is actually small enough that I just downloaded everything within a few minutes. So let’s read it in and start parsing!

Read in the last 10 years of US federal grant data from Federal RePORTER

Download the data here: https://federalreporter.nih.gov/FileDownload

data <- rbind(
	read.csv('FedRePORTER_PRJ_C_FY2017.csv.gz', stringsAsFactors = FALSE),
	read.csv('FedRePORTER_PRJ_C_FY2016.csv.gz', stringsAsFactors = FALSE),
	read.csv('FedRePORTER_PRJ_C_FY2015.csv.gz', stringsAsFactors = FALSE),
	read.csv('FedRePORTER_PRJ_C_FY2014.csv.gz', stringsAsFactors = FALSE),
	read.csv('FedRePORTER_PRJ_C_FY2013.csv.gz', stringsAsFactors = FALSE),
	read.csv('FedRePORTER_PRJ_C_FY2012.csv.gz', stringsAsFactors = FALSE),
	read.csv('FedRePORTER_PRJ_C_FY2010.csv.gz', stringsAsFactors = FALSE),
	read.csv('FedRePORTER_PRJ_C_FY2009.csv.gz', stringsAsFactors = FALSE),
	read.csv('FedRePORTER_PRJ_C_FY2008.csv.gz', stringsAsFactors = FALSE)
)

Of course, there are some know issues with this data in terms of consistency of organization names, lack of information from private foundation grants, yada yada. Certain institutions like Harvard vs. Harvard Medical vs. Brigham and Women’s Hospital vs. Children’s Hospital vs. MGH vs. Dana Farber etc are all broken up and therefore any statistics computed on an individual institution may not be an accurate reflection of the broader network’s capacity to win grants. But it’s what we have. So we will work with it.

Question 1: Do R1 institutions really get more grants than R2 and R3 institutions?

What institutions should you look into? Well, if you are interested in writing grants and doing research, most people will tell you: an R1 institution (R1 is a category that the Carnegie Classification of Institutions of Higher Education uses to indicate universities in the United States that engage in extensive research activity). But is it really true that R1 institutions do ‘better’ funding-wise than R2 institutions? Let’s let the data speak for itself!

I got my list of institution names from Wikipedia: https://en.wikipedia.org/wiki/List_of_research_universities_in_the_United_States

R1uni.info <- read.csv('R1_Institutions.csv', stringsAsFactors = FALSE)
R1uni <- toupper(R1uni.info$Institution)
R2uni.info <- read.csv('R2_Institutions.csv', stringsAsFactors = FALSE)
R2uni <- toupper(R2uni.info$Institution)
R3uni.info <- read.csv('R3_Institutions.csv', stringsAsFactors = FALSE)
R3uni <- toupper(R3uni.info$Institution)

Note the Wiki institution names don’t match up perfectly with the database organization names.

# Clean up organization names (still some errors but this captures most of them)
data$ORGANIZATION_NAME <- gsub('THE ', '', data$ORGANIZATION_NAME)

numGrants <- table(data$ORGANIZATION_NAME)

We can use highcharter to create interactive visualizations.

d1 <- sort(numGrants[R1uni], decreasing=TRUE)
df1 <- data.frame(
	'uni' = names(d1),
    'grants' = as.numeric(d1)
)
d2 <- sort(numGrants[R2uni], decreasing=TRUE)
df2 <- data.frame(
	'uni' = names(d2),
	'grants' = as.numeric(d2)
)
d3 <- sort(numGrants[R3uni], decreasing=TRUE)
df3 <- data.frame(
	'uni' = names(d3),
	'grants' = as.numeric(d3)
)
			
library(highcharter)
highchart() %>% 
	hc_add_series(df1, "column", hcaes(x = uni, y = grants), name='R1') %>%
	hc_add_series(df2, "column", hcaes(x = uni, y = grants), name='R2') %>%
	hc_add_series(df3, "column", hcaes(x = uni, y = grants), name='R3') %>%
	hc_title(text = "Number of Grants per Institution") 

So as we can see, in general, R1 institutions do get more grants. However, there is a wide distribution! And there are definitely R2 institutions with more grants than a lot of R1 institutions! Of course, this is just looking at the number of grants per institution without normalizing for the number of professors.

For the time being, let’s restrict our remaining analysis to the R1 or R2 institutions with more than 2000 grants.

uni <- na.omit(c(R1uni[numGrants[R1uni]>2000], R2uni[numGrants[R2uni]>2000]))

Question 2: How many research grants does each professor have on average?

Does 1 person have a lot of grants, or is it well distributed across many people? To be more accurate, let’s restrict to new R-series grants (Research grants as opposed to training or program and center grants), which always start with 1R in their project names. Basically, we want to know, within the past decade, how many of these R-series grants a ‘typical’ professor gets at each institution so we may better gauge our expectations and their expectations for us.

npi <- lapply(uni, function(school) {
  # limit to R-series grants
  vi1 <- grepl('^1R', data$PROJECT_NUMBER)
  # limit to school
  vi2 <- data$ORGANIZATION_NAME==school
  # get primary PI name
  npi <- na.omit(data[vi1&vi2,]$CONTACT_PI_PROJECT_LEADER)
  # compute how often name occurs
  table(npi)
})
names(npi) <- uni
d <- sort(sapply(npi, mean), decreasing=TRUE)
df <- data.frame(
  'uni' = names(d),
  'grants' = as.numeric(d)
)

library(highcharter)
highchart() %>% 
  hc_add_series(df, "column", hcaes(x = uni, y = grants)) %>%
  hc_title(text = "Average number of New R1 Grants per Professor per Institution")

The distribution doesn’t look as wide as I would’ve expected! It seems like at almost all institutions, professors get on average 1 to 2 new R-series grants in the past decade.

Just out of curiousity, what is the maximum number of new R-series grants a professor has gotten within the past decade at an institution? It’ll be left as an exercise to the reader to stalk these super stars and learn their secrets.

d <- sort(sapply(npi, max), decreasing=TRUE)
df <- data.frame(
  'uni' = names(d),
  'grants' = as.numeric(d)
)

library(highcharter)
highchart() %>% 
  hc_add_series(df, "column", hcaes(x = uni, y = grants)) %>%
  hc_title(text = "Max number of New R1 Grants per Professor per Institution")

Question 3: How often do PIs write grants together?

Research often take teamwork! We can gauge the collaborativeness of an institution based on how frequently its professors write grants together. So let’s count how frequently there is more than 1 PI on a single grant.

nopis <- unlist(lapply(uni, function(school) {
  vi <- data$ORGANIZATION_NAME==school
  opis <- data[vi,]$OTHER_PIS
  sum(opis!='')/length(opis) # count how often other PIs is not empty
}))
names(nopis) <- uni

d <- sort(nopis, decreasing=TRUE)
df <- data.frame(
  'uni' = names(d),
  'collabs' = as.numeric(d)
)

library(highcharter)
highchart() %>% 
  hc_add_series(df, "column", hcaes(x = uni, y = collabs)) %>%
  hc_title(text = "Percentage of grants with >1 PI per Institution")

Of course there are other ways to collaboration without necessarily being on each others’ grants as co-PIs.

Anyways, could this data-driven approach help us focus on some schools that maybe weren’t originally in our consideration? Seems like it could be a useful exploratory analysis! More to come!

Some additional fun questions left to the reader as an exercise

  • What is the distribution of funding amounts at each institution based on FY_TOTAL_COST?
  • Does an institution’s total funding amount or number of grants increase with time or is it relatively stable?
  • What research topics are the most popular among funded R-series grants based on PROJECT_TERMS? Is there any enrichment? Similarly, have any funded research topics increased in frequency over the past decade?
  • Can we integrate additional datasets and connect PIs to departments to calculate department-level statistics?