Jun 11, 2018

Explore more posts
by Tags | by Date | View All

Keep up to date

Top 10 Must Use Terms To Get Your Next Nih Grant Funded

As mentioned in my previous blog post, the NIH has an excellent database of all the federally funded grants in the past decade. This database includes nicely parsed Project Terms related to every funded project.

While I have my own personal research interests (in computational methods development for precision medicine and cancer treatment), I wonder: what are funding agencies interested in? Are there certain hot topics I should consider looking into? Also, I’ve heard from others who have served on study commitees that it is difficult to read all of these grants word for word. So are there certain words grant reviewers may look for that end up becoming really common in funded grants?

Thanks to this database, we can now take a data driven approach to addressing our questions!

Read in the last 10 years of US federal grant data from Federal RePORTER

First, let’s read in all the data. You can download the data here: https://federalreporter.nih.gov/FileDownload

# Read in all data
data <- rbind(
  read.csv('FedRePORTER_PRJ_C_FY2017.csv.gz', stringsAsFactors = FALSE),
  read.csv('FedRePORTER_PRJ_C_FY2016.csv.gz', stringsAsFactors = FALSE),
  read.csv('FedRePORTER_PRJ_C_FY2015.csv.gz', stringsAsFactors = FALSE),
  read.csv('FedRePORTER_PRJ_C_FY2014.csv.gz', stringsAsFactors = FALSE),
  read.csv('FedRePORTER_PRJ_C_FY2013.csv.gz', stringsAsFactors = FALSE),
  read.csv('FedRePORTER_PRJ_C_FY2012.csv.gz', stringsAsFactors = FALSE),
  read.csv('FedRePORTER_PRJ_C_FY2011.csv.gz', stringsAsFactors = FALSE),
  read.csv('FedRePORTER_PRJ_C_FY2010.csv.gz', stringsAsFactors = FALSE),
  read.csv('FedRePORTER_PRJ_C_FY2009.csv.gz', stringsAsFactors = FALSE),
  read.csv('FedRePORTER_PRJ_C_FY2008.csv.gz', stringsAsFactors = FALSE)
)

For the purposes of this blog post, I will restrict to new R-series grants (Research grants as opposed to training or program and center grants), which always start with 1R in their project names.

# Restrict to new R-series grants
vi <- grepl('^1R', data$PROJECT_NUMBER)
table(vi)

## vi
##  FALSE   TRUE 
## 888420  88365

data <- data[vi,]

Now, let’s parse out the project terms.

# Get project terms
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
terms <- lapply(data$PROJECT_TERMS, function(x) {
  sapply(strsplit(x, ';')[[1]], trim)
})

Question 1: What are the most common project terms?

# Most common terms
mct <- table(unlist(terms))
mct <- mct/nrow(data) # normalize
mct <- sort(mct, decreasing=TRUE)
length(mct)

## [1] 45227

head(mct, n=10)

## 
##                 Testing                    base             Development 
##               0.6626379               0.6594127               0.5905053 
##                   Goals                    Data                Research 
##               0.5819499               0.5762915               0.5411758 
##                   novel public health relevance                    Role 
##               0.5117863               0.4999717               0.4860069 
##                 Disease 
##               0.4772138

It looks like there are over 45k different terms represented. However, the most common terms are quite boring. Almost every grant talks about ‘Testing’, ‘Research’, ‘Goals’, ‘Data’, and ‘Disease’…as expected.

Question 2: What project terms are becoming more common?

Let’s instead look for terms that are becoming more common. We may interpret these are up-and-coming hot topics that we should keep an eye out on.

# Frequency of terms by year
freq <- lapply(2008:2017, function(year){
  vi <- data$FY == year
  universe <- table(unlist(terms[vi]))
  universe <- sort(universe, decreasing=TRUE)
})

# normalizing factor, grants per year
norm <- sapply(2008:2017, function(year){
    sum(data$FY == year)
})

# look at a union of the top 2000 most common terms per year
# just so we don't have to look through >45k terms
top.words <- unique(unlist(lapply(freq, function(x) names(x)[1:2000])))
length(top.words)

## [1] 2675

# Look at slope of trend
slopes <- sapply(top.words, function(word) {
  word.freq <- unlist(lapply(freq, function(x) {
    x[word]
  }))
  df <- data.frame(x = 2008:2017, y=word.freq)
  # plot(df, type='l', main=i)
  # get slope
  lm(y~x, df)$coefficients[[2]]
})
names(slopes) <- top.words
slopes <- sort(slopes, decreasing=TRUE)
head(slopes, n=10)

##                    novel               innovation       targeted treatment 
##                174.43636                145.75152                139.00000 
## transcriptome sequencing               Zika Virus                     Data 
##                126.20000                124.00000                106.18182 
##              Therapeutic              mouse model                    Human 
##                 92.40000                 87.66667                 87.38788 
##    CRISPR/Cas technology 
##                 87.00000

Let’s interactively visualize a few of these trends using highcharter!

words <- names(head(slopes, n=10))
df <- do.call(cbind, lapply(words, function(word) {
  word.freq <- unlist(lapply(freq, function(x) {
    x[word]
  }))
  word.freq/norm
}))
rownames(df) <- 2008:2017
colnames(df) <- words

library(reshape2)
dfm <- melt(df)

# Adapted from highcharter::export_hc() to not write to file
library(jsonlite)
library(stringr)
write_hc <- function(hc, name) {
    JS_to_json <- function(x) {
        class(x) <- "json"
        return(x)
    }
    hc$x$hc_opts <- rapply(object = hc$x$hc_opts, f = JS_to_json, 
        classes = "JS_EVAL", how = "replace")
    js <- toJSON(x = hc$x$hc_opts, pretty = TRUE, auto_unbox = TRUE, 
        json_verbatim = TRUE, force = TRUE, null = "null", na = "null")
    js <- sprintf("$(function(){\n\t$('#%s').highcharts(\n%s\n);\n});", 
            name, js)
    return(js)
}
# Helper function to write javascript in Rmd
# Thanks so http://livefreeordichotomize.com/2017/01/24/custom-javascript-visualizations-in-rmarkdown/
send_hc_to_js <- function(hc, hcid){
  cat(
    paste(
      '<span id=\"', hcid, '\"></span>',
      '<script>',
      write_hc(hc, hcid),
      '</script>', 
      sep="")
  )
}


library(highcharter)
hc1 <- highchart() %>% 
  hc_add_series(dfm, "line", hcaes(x = Var1, y = value, group = Var2)) %>%
  hc_title(text = 'Frequency of Top 10 Up-and-Coming Project Terms in New Funded R-series Grants') %>%
  hc_legend(align = "right", layout = "vertical")

send_hc_to_js(hc1, 'hc1')

First thought: Wow! In 2008, only 40% of new funded NIH R-series grants had ‘novel’ as a Project Term and now nearly 57% do. So if you want to get your grant funded, throw a few mentions of ‘novel’ in there ;)

Second thought: some of the less frequent terms the most interesting. In 2013, only 2% of funded funded NIH R-series grants had ‘transcriptome sequencing’ as a Project Term but this has steadily increased to now over 6%! Similarly, CRISPR/Cas technology increased from 2.6% to 3.6% from 2016 to 2017. Most interestingly, in my opinion, is the rise in Zika Virus related projects following the 2015-2016 Zika epidemic.

Question 3: Is my research field fundable?

I’m going to look through a few terms related to my own work.

words <- c('Computer software', 'Statistical Methods', 'open source', 'single cell technology', 'tumor microenvironment', 'precision medicine')
df <- do.call(cbind, lapply(words, function(word) {
  word.freq <- unlist(lapply(freq, function(x) {
    x[word]
  }))
  word.freq/norm
}))
rownames(df) <- 2008:2017
colnames(df) <- words

library(reshape2)
dfm <- melt(df)

library(highcharter)
hc2 <- highchart() %>% 
  hc_add_series(dfm, "line", hcaes(x = Var1, y = value, group = Var2)) %>%
  hc_title(text = 'Frequency of Jean\'s Project Terms in New Funded R-series Grants') %>%
  hc_legend(align = "right", layout = "vertical")

send_hc_to_js(hc2, 'hc2')

Unfortunately, looking at the y-axis, fairly few grants with computational and statistical project terms get funded each year and there does not seem to be an increasing trend. This is not so say that a grant with computational and statistical aims can’t get funded. But perhaps the primary emphasis of your grant should be on the precision-medicine-related biological applications rather than the open-source nature of your project ;)

Other observations

words <- c("follow-up", "Surveys", "Questionnaires", "Small Interfering RNA", "Mental Health", "depressive symptoms", "Drug abuse", "Substance abuse problem")
df <- do.call(cbind, lapply(words, function(word) {
  word.freq <- unlist(lapply(freq, function(x) {
    x[word]
  }))
  word.freq/norm
}))
rownames(df) <- 2008:2017
colnames(df) <- words

library(reshape2)
dfm <- melt(df)

library(highcharter)
hc3 <- highchart() %>% 
  hc_add_series(dfm, "line", hcaes(x = Var1, y = value, group = Var2)) %>%
  hc_title(text = 'Frequency of Select Project Terms in New Funded R-series Grants') %>%
  hc_legend(align = "right", layout = "vertical")

send_hc_to_js(hc3, 'hc3')

Projects related to follow-up studies seems to be going down slightly. This is consistent with the increasing important of innovation and novelty. Similarly, projects related to surveys and questionaires is going down. This is consistent with the increasing emphasis on mouse modules and molecular mechanisms. Likewise, projects related to siRNAs is going down, consistent with the increasing preference to CRISPR/Cas.

Unfortunately, unlike Zika, we are not seeing a substantial increase in the proportion of funded projects related to mental health or drug abuse despite recent epidemics in these areas.

Discussion

So in conclusion, if you want your next NIH R-series grant to get funded, all you have to do is write a project about a ‘novel’ ‘CRISPR/Cas’ ‘mouse model’ to discover ‘targeted treatment’ ‘Therapeutics’s for the ’Zika virus’ in ’Human’s! Just kidding. But hopefully the availability of this kind of data can at least help guide our grant writing with more informed word-choices and research topics.

Keep in mind that this analysis is by no means exhaustive. Certain terms and topics may be represented differently across different years, confounding trends. It is also unclear how these Project Terms are auto-generated for each grant and if biases may be introduced during that stage.

Other potentially interesting questions:

Are certain topics more common in R1 vs. R2 vs. R3 institutions?
Which schools are getting lots of grants in your research area of interest?
Can we intergrate natural language processing and topic modeling to see if certain groups of terms (rather than individual terms) are enriched in the most commonly funded grants or are super hot and increasing in frequency over time?