Top 10 Must Use Terms To Get Your Next Nih Grant Funded
As mentioned in my previous blog
post,
the NIH has an excellent database of all the federally funded grants in
the past decade. This database includes nicely parsed Project Terms
related to every funded project.
While I have my own personal research interests (in computational methods development for precision medicine and cancer treatment), I wonder: what are funding agencies interested in? Are there certain hot topics I should consider looking into? Also, I’ve heard from others who have served on study commitees that it is difficult to read all of these grants word for word. So are there certain words grant reviewers may look for that end up becoming really common in funded grants?
Thanks to this database, we can now take a data driven approach to addressing our questions!
Read in the last 10 years of US federal grant data from Federal RePORTER
First, let’s read in all the data. You can download the data here: https://federalreporter.nih.gov/FileDownload
# Read in all data
data <- rbind(
read.csv('FedRePORTER_PRJ_C_FY2017.csv.gz', stringsAsFactors = FALSE),
read.csv('FedRePORTER_PRJ_C_FY2016.csv.gz', stringsAsFactors = FALSE),
read.csv('FedRePORTER_PRJ_C_FY2015.csv.gz', stringsAsFactors = FALSE),
read.csv('FedRePORTER_PRJ_C_FY2014.csv.gz', stringsAsFactors = FALSE),
read.csv('FedRePORTER_PRJ_C_FY2013.csv.gz', stringsAsFactors = FALSE),
read.csv('FedRePORTER_PRJ_C_FY2012.csv.gz', stringsAsFactors = FALSE),
read.csv('FedRePORTER_PRJ_C_FY2011.csv.gz', stringsAsFactors = FALSE),
read.csv('FedRePORTER_PRJ_C_FY2010.csv.gz', stringsAsFactors = FALSE),
read.csv('FedRePORTER_PRJ_C_FY2009.csv.gz', stringsAsFactors = FALSE),
read.csv('FedRePORTER_PRJ_C_FY2008.csv.gz', stringsAsFactors = FALSE)
)
For the purposes of this blog post, I will restrict to new R-series grants (Research grants as opposed to training or program and center grants), which always start with 1R in their project names.
# Restrict to new R-series grants
vi <- grepl('^1R', data$PROJECT_NUMBER)
table(vi)
## vi
## FALSE TRUE
## 888420 88365
data <- data[vi,]
Now, let’s parse out the project terms.
# Get project terms
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
terms <- lapply(data$PROJECT_TERMS, function(x) {
sapply(strsplit(x, ';')[[1]], trim)
})
Question 1: What are the most common project terms?
# Most common terms
mct <- table(unlist(terms))
mct <- mct/nrow(data) # normalize
mct <- sort(mct, decreasing=TRUE)
length(mct)
## [1] 45227
head(mct, n=10)
##
## Testing base Development
## 0.6626379 0.6594127 0.5905053
## Goals Data Research
## 0.5819499 0.5762915 0.5411758
## novel public health relevance Role
## 0.5117863 0.4999717 0.4860069
## Disease
## 0.4772138
It looks like there are over 45k different terms represented. However, the most common terms are quite boring. Almost every grant talks about ‘Testing’, ‘Research’, ‘Goals’, ‘Data’, and ‘Disease’…as expected.
Question 2: What project terms are becoming more common?
Let’s instead look for terms that are becoming more common. We may interpret these are up-and-coming hot topics that we should keep an eye out on.
# Frequency of terms by year
freq <- lapply(2008:2017, function(year){
vi <- data$FY == year
universe <- table(unlist(terms[vi]))
universe <- sort(universe, decreasing=TRUE)
})
# normalizing factor, grants per year
norm <- sapply(2008:2017, function(year){
sum(data$FY == year)
})
# look at a union of the top 2000 most common terms per year
# just so we don't have to look through >45k terms
top.words <- unique(unlist(lapply(freq, function(x) names(x)[1:2000])))
length(top.words)
## [1] 2675
# Look at slope of trend
slopes <- sapply(top.words, function(word) {
word.freq <- unlist(lapply(freq, function(x) {
x[word]
}))
df <- data.frame(x = 2008:2017, y=word.freq)
# plot(df, type='l', main=i)
# get slope
lm(y~x, df)$coefficients[[2]]
})
names(slopes) <- top.words
slopes <- sort(slopes, decreasing=TRUE)
head(slopes, n=10)
## novel innovation targeted treatment
## 174.43636 145.75152 139.00000
## transcriptome sequencing Zika Virus Data
## 126.20000 124.00000 106.18182
## Therapeutic mouse model Human
## 92.40000 87.66667 87.38788
## CRISPR/Cas technology
## 87.00000
Let’s interactively visualize a few of these trends using
highcharter
!
words <- names(head(slopes, n=10))
df <- do.call(cbind, lapply(words, function(word) {
word.freq <- unlist(lapply(freq, function(x) {
x[word]
}))
word.freq/norm
}))
rownames(df) <- 2008:2017
colnames(df) <- words
library(reshape2)
dfm <- melt(df)
# Adapted from highcharter::export_hc() to not write to file
library(jsonlite)
library(stringr)
write_hc <- function(hc, name) {
JS_to_json <- function(x) {
class(x) <- "json"
return(x)
}
hc$x$hc_opts <- rapply(object = hc$x$hc_opts, f = JS_to_json,
classes = "JS_EVAL", how = "replace")
js <- toJSON(x = hc$x$hc_opts, pretty = TRUE, auto_unbox = TRUE,
json_verbatim = TRUE, force = TRUE, null = "null", na = "null")
js <- sprintf("$(function(){\n\t$('#%s').highcharts(\n%s\n);\n});",
name, js)
return(js)
}
# Helper function to write javascript in Rmd
# Thanks so http://livefreeordichotomize.com/2017/01/24/custom-javascript-visualizations-in-rmarkdown/
send_hc_to_js <- function(hc, hcid){
cat(
paste(
'<span id=\"', hcid, '\"></span>',
'<script>',
write_hc(hc, hcid),
'</script>',
sep="")
)
}
library(highcharter)
hc1 <- highchart() %>%
hc_add_series(dfm, "line", hcaes(x = Var1, y = value, group = Var2)) %>%
hc_title(text = 'Frequency of Top 10 Up-and-Coming Project Terms in New Funded R-series Grants') %>%
hc_legend(align = "right", layout = "vertical")
send_hc_to_js(hc1, 'hc1')
First thought: Wow! In 2008, only 40% of new funded NIH R-series grants had ‘novel’ as a Project Term and now nearly 57% do. So if you want to get your grant funded, throw a few mentions of ‘novel’ in there ;)
Second thought: some of the less frequent terms the most interesting. In 2013, only 2% of funded funded NIH R-series grants had ‘transcriptome sequencing’ as a Project Term but this has steadily increased to now over 6%! Similarly, CRISPR/Cas technology increased from 2.6% to 3.6% from 2016 to 2017. Most interestingly, in my opinion, is the rise in Zika Virus related projects following the 2015-2016 Zika epidemic.
Question 3: Is my research field fundable?
I’m going to look through a few terms related to my own work.
words <- c('Computer software', 'Statistical Methods', 'open source', 'single cell technology', 'tumor microenvironment', 'precision medicine')
df <- do.call(cbind, lapply(words, function(word) {
word.freq <- unlist(lapply(freq, function(x) {
x[word]
}))
word.freq/norm
}))
rownames(df) <- 2008:2017
colnames(df) <- words
library(reshape2)
dfm <- melt(df)
library(highcharter)
hc2 <- highchart() %>%
hc_add_series(dfm, "line", hcaes(x = Var1, y = value, group = Var2)) %>%
hc_title(text = 'Frequency of Jean\'s Project Terms in New Funded R-series Grants') %>%
hc_legend(align = "right", layout = "vertical")
send_hc_to_js(hc2, 'hc2')
Unfortunately, looking at the y-axis, fairly few grants with computational and statistical project terms get funded each year and there does not seem to be an increasing trend. This is not so say that a grant with computational and statistical aims can’t get funded. But perhaps the primary emphasis of your grant should be on the precision-medicine-related biological applications rather than the open-source nature of your project ;)
Other observations
words <- c("follow-up", "Surveys", "Questionnaires", "Small Interfering RNA", "Mental Health", "depressive symptoms", "Drug abuse", "Substance abuse problem")
df <- do.call(cbind, lapply(words, function(word) {
word.freq <- unlist(lapply(freq, function(x) {
x[word]
}))
word.freq/norm
}))
rownames(df) <- 2008:2017
colnames(df) <- words
library(reshape2)
dfm <- melt(df)
library(highcharter)
hc3 <- highchart() %>%
hc_add_series(dfm, "line", hcaes(x = Var1, y = value, group = Var2)) %>%
hc_title(text = 'Frequency of Select Project Terms in New Funded R-series Grants') %>%
hc_legend(align = "right", layout = "vertical")
send_hc_to_js(hc3, 'hc3')
Projects related to follow-up studies seems to be going down slightly. This is consistent with the increasing important of innovation and novelty. Similarly, projects related to surveys and questionaires is going down. This is consistent with the increasing emphasis on mouse modules and molecular mechanisms. Likewise, projects related to siRNAs is going down, consistent with the increasing preference to CRISPR/Cas.
Unfortunately, unlike Zika, we are not seeing a substantial increase in the proportion of funded projects related to mental health or drug abuse despite recent epidemics in these areas.
Discussion
So in conclusion, if you want your next NIH R-series grant to get funded, all you have to do is write a project about a ‘novel’ ‘CRISPR/Cas’ ‘mouse model’ to discover ‘targeted treatment’ ‘Therapeutics’s for the ’Zika virus’ in ’Human’s! Just kidding. But hopefully the availability of this kind of data can at least help guide our grant writing with more informed word-choices and research topics.
Keep in mind that this analysis is by no means exhaustive. Certain terms and topics may be represented differently across different years, confounding trends. It is also unclear how these Project Terms are auto-generated for each grant and if biases may be introduced during that stage.
Other potentially interesting questions:
- Are certain topics more common in R1 vs. R2 vs. R3 institutions?
- Which schools are getting lots of grants in your research area of interest?
- Can we intergrate natural language processing and topic modeling to see if certain groups of terms (rather than individual terms) are enriched in the most commonly funded grants or are super hot and increasing in frequency over time?
- Older
- Newer
Recent Posts
- Using AI to find heterogeneous scientific speakers on 04 November 2024
- The many ways to calculate Moran's I for identifying spatially variable genes in spatial transcriptomics data on 29 August 2024
- Characterizing spatial heterogeneity using spatial bootstrapping with SEraster on 23 July 2024
- I use R to (try to) figure out which hospital I should go to for shoppable medical services by comparing costs through analyzing Hospital Price Transparency data on 22 April 2024
- Cross modality image alignment at single cell resolution with STalign on 11 April 2024