1. Machine Learning Feature Selection For Diffexp Analysis

    Feature Selection for Differential Expression Analysis and Marker Selection Introduction In transcriptomics analysis, we are often interested in identifying differentially expressed genes. For example, if we have two conditions or cell types, we may be interested in what genes are significantly upregulated in condition A vs. B or cell type X vs. Y and vice versa. Differential expression analysis is often performed as Wilcox tests, T-tests, or other similar tests for differences in distribution. However,...


    Continue reading …
  2. 5 Useful Bash Aliases And Functions

    5 Useful Bash Aliases and Functions For Lazy Bioinformaticians Continuing on our theme of making bioinformatics more sexy with buzzfeed-esque blog post titles, here are 5 useful bash aliases and functions so you can remember fewer non-intuitive options, type fewer keys for the same output, and overall be more productive and efficient in your bioinformatics analysis :D ie. have more time to look at dank memes. I’ll try to keep to aliases and functions that...


    Continue reading …
  3. A Practical Introduction To Finite Mixture Models

    A practical introduction to finite mixture modeling with flexmix in R Introduction Finite mixture models are very useful when applied to data where observations originate from various groups and the group affiliations are not known. For example, in single cell RNA-seq data, transcripts in each cell can be modeled as a mixture of two probabilistic processes: 1) a negative binomial process for when a transcript is amplified and detected at a level correlating with its...


    Continue reading …
  4. 5 Must Dos For Efficient Bioinformatics

    5 must-dos for efficient bioinformatics My colleague Kamil was joking about how we need to make bioinformatics sexier and more click-bait-y with those ridiculous buzz-feed-esque headlines like ‘N ways to X your Y’ and ‘M best Ws that will K your J’. So here are my 5 must-dos for efficient bioinformatics that I’ve tried to get all my students to adopt. Get a text editor that can send commands to the terminal The biggest efficiency...


    Continue reading …
  5. Custom Sequencing Library Bioinformatics

    Bioinformatics for custom sequencing library constructions Sequencing has become so streamlined that we often just use standard library prep kits, made for particular sequencers, followed by proprietary bioinformatics software for demultiplexing and quantification. But, if we want to design custom library structure, perhaps for use in multiplexed droplet-based approaches, we will need to come up with our own bioinformatics pipelines. In this tutorial, I will take you through a recent experience I had analyzing reads...


    Continue reading …
  6. Multiclass Diffential Expression Analysis

    Multi-class / Multi-group Differential Expression Analysis Introduction In transcriptomics analysis, we are often interested in identifying differentially expressed genes. For example, if we have two conditions or cell types, we may be interested in what genes are significantly upregulated in condition A vs. B or cell type X vs. Y and vice versa. However, what happens when you have multiple conditions or many cell types? In this tutorial, I will use simulated data to demonstrate...


    Continue reading …
  7. How To Be A Research Parasite

    How to be a “research parasite”: a guide to analyzing public sequencing data from GEO In this tutorial, I will take you through my workflow for obtaining public sequencing data available on NCBI GEO. Let’s say for example, I am interested in analyzing the single cell RNA-seq data found in this paper: Single-Cell Analysis Reveals a Close Relationship between Differentiating Dopamine and Subthalamic Nucleus Neuronal Lineages In the paper, the authors note that “The accession...


    Continue reading …
  8. Cigar Strings For Dummies

    Smoke and CIGAR (strings) The ‘CIGAR’ (Compact Idiosyncratic Gapped Alignment Report) string is how the SAM/BAM format represents spliced alignments. Understanding the CIGAR string will help you understand how your query sequence aligns to the reference genome. For example, the position stored is the left most coordinate of the alignment. To get to the right coordinate, you have to parse the CIGAR string. Let’s consider a few concrete examples. First example: The shown alignment will...


    Continue reading …
  9. Practical Machine Learning For Everyday Life

    In this very practical R tutorial, we will see if we can use our machine learning skills to study something we enjoy in everyday life: wine. We will use wine quality data from the UCI Machine Learning Repository. These two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. To start, read the data into R. wine1.url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv" wine1 <- read.csv(wine1.url, header=TRUE, sep=';') wine2.url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv" wine2 <- read.csv(wine2.url,...


    Continue reading …
  10. Lsf For Dummies

    LSF for dummies (Actually, if you’ve gotten to a point you need to use LSF, you’re far from a dummy ;P) LSF (Load Sharing Facility) is a system to manage programs that generally cannot be run interactively on a machine because they require too much CPU-time, memory, or other system resources. For that reason, those large programs have to be run in batch as jobs. LSF takes care of that batch management. Based on the...


    Continue reading …