1. Custom Sequencing Library Bioinformatics

    Bioinformatics for custom sequencing library constructions Sequencing has become so streamlined that we often just use standard library prep kits, made for particular sequencers, followed by proprietary bioinformatics software for demultiplexing and quantification. But, if we want to design custom library structure, perhaps for use in multiplexed droplet-based approaches, we will need to come up with our own bioinformatics pipelines. In this tutorial, I will take you through a recent experience I had analyzing reads...


    Continue reading …
  2. Multiclass Diffential Expression Analysis

    Multi-class / Multi-group Differential Expression Analysis Introduction In transcriptomics analysis, we are often interested in identifying differentially expressed genes. For example, if we have two conditions or cell types, we may be interested in what genes are significantly upregulated in condition A vs. B or cell type X vs. Y and vice versa. However, what happens when you have multiple conditions or many cell types? In this tutorial, I will use simulated data to demonstrate...


    Continue reading …
  3. How To Be A Research Parasite

    How to be a “research parasite”: a guide to analyzing public sequencing data from GEO In this tutorial, I will take you through my workflow for obtaining public sequencing data available on NCBI GEO. Let’s say for example, I am interested in analyzing the single cell RNA-seq data found in this paper: Single-Cell Analysis Reveals a Close Relationship between Differentiating Dopamine and Subthalamic Nucleus Neuronal Lineages In the paper, the authors note that “The accession...


    Continue reading …
  4. Cigar Strings For Dummies

    Smoke and CIGAR (strings) The ‘CIGAR’ (Compact Idiosyncratic Gapped Alignment Report) string is how the SAM/BAM format represents spliced alignments. Understanding the CIGAR string will help you understand how your query sequence aligns to the reference genome. For example, the position stored is the left most coordinate of the alignment. To get to the right coordinate, you have to parse the CIGAR string. Let’s consider a few concrete examples. First example: The shown alignment will...


    Continue reading …
  5. Practical Machine Learning For Everyday Life

    In this very practical R tutorial, we will see if we can use our machine learning skills to study something we enjoy in everyday life: wine. We will use wine quality data from the UCI Machine Learning Repository. These two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. To start, read the data into R. wine1.url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv" wine1 <- read.csv(wine1.url, header=TRUE, sep=';') wine2.url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv" wine2 <- read.csv(wine2.url,...


    Continue reading …
  6. Lsf For Dummies

    LSF for dummies (Actually, if you’ve gotten to a point you need to use LSF, you’re far from a dummy ;P) LSF (Load Sharing Facility) is a system to manage programs that generally cannot be run interactively on a machine because they require too much CPU-time, memory, or other system resources. For that reason, those large programs have to be run in batch as jobs. LSF takes care of that batch management. Based on the...


    Continue reading …
  7. Mapping Snps And Peaks To Genes In R

    Mapping SNPs and peaks to genes in R We are often interested in mapping mutations or SNPs to genes, or peaks to genes, or genes to regions of copy number alteration, etc. The general computational problem is quite similar for all these cases: we have two sets of genomic regions that we seek to overlap. This can be accomplished very quickly in R using GenomicRanges. To learn more about GenomicRanges, consult Bioconductor. In this particular...


    Continue reading …
  8. Identifying And Characterizing Heterogeneity In Single Cell Rnaseq Data

    Identifying and Characterizing Heterogeneity in Single Cell RNA-seq Data In this tutorial, we will become familiar with a few computational techniques we can use to identify and characterize heterogeneity in single cell RNA-seq data. Pre-prepared data for this tutorial can be found as part of the Single Cell Genomics 2016 Workshop I did at Harvard Medical School. Getting started A single cell dataset from Camp et al. has been pre-prepared for you. The data is...


    Continue reading …
  9. Signaling Network Reconstruction Using Bayesian Networks In R

    In in landmark paper “Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data”, Sachs et al. applied Bayesian Networks on multi-parameter flow cytometry data to reconstruct signaling relationships and predict novel interpathway network causalities. Following a tutorial by Marco Scutari, I attempt to reproduce to the best of my abilities the statistical analysis of the paper using Marco’s bnlearn R package. library(bnlearn) library(Rgraphviz) First, read in and process the data. Since this is flow cytometry data,...


    Continue reading …
  10. Single Cell Genomics 2016

    Notes from the Single Cell Genomics Conference, September 14-16 2016, Wellcome Genome Campus, Hixton, Cambridge, UK (Website) Keynote Lecture: Genomic insights into human cortical development, lissencephaly, and Zika microcephaly; Arnold Kriegstein; University of California, San Francisco, USA Single cell genomics to look in depth at developmental human brain Clinical applications of identified cell types by genomic signature ie zika Human brain is 1000x bigger than mice mostly in cortex Elephants and proposes have bigger more...


    Continue reading …