Developers versus consumers of bioinformatics analysis tools

Life in the middle

As a bioinformatics applications scientist, I work in a middle ground between those who develop code for analyzing next-gen sequencing data and those who consume that analysis.   The developers are often people trained in computer science, mathematics, and statistics.   The consumers are often people trained in biology and medicine.    There is some overlap, of course, but if you’ll allow a broad generalization, I think the two groups (developers and consumers) are separated by large cultural differences in science.

Ensemble focus vs. single-gene focus

I think one of the biggest differences from my experience is the approach to conceptualizing the results of a next-gen seq experiment (e.g., RNA-seq).    People on the methods side tend to think in terms of ensembles and distributions.   They are interested in how the variation observed across all 30,000 genes can be modeled and used to estimate differential expression.   They are interested in concepts like shrinkage estimators, bayesian priors, and hypothesis weighting.  The table of differentially expressed features is thought to have meaning mainly in the statistical analysis of pathway enrichment (another ensemble).

Conversely, biologists have a drastically different view.   Often they care about a single gene or a handful of genes and how those genes vary across conditions of interest in the system that they are studying.  This is a rational response to the complexity of biological systems; no one can keep the workings of hundreds of genes in mind.  In order to make progress, you must focus.  However, this narrow focus leads investigators to sometimes cherry pick results pertaining to their ‘pet’ genes.  This can invest those results with more meaning than is warranted from a single experiment.

The gene-focus of biologists also leads to clashes with the ensemble-focus of bioinformatics software.  For example, major DE analysis packages like DESeq2 do not have convenience functions to make volcano plots, even though I’ve found that those kinds of plots are the most useful and easiest to understand for biologists.   Sleuth does have a volcano plotting function, but doesn’t allow for labeling genes.  In my experience, however, biologists want a high-res figure with common name gene labels (not ensemble transcript IDs) that they can circle and consider for wet lab validation.

New software to address the divide?

I am hopeful about the recent release of the new “bcbioRNASeq” package from Harvard Chan school bioinformatics (developers of ‘bcbio’ RNA-seq pipeline software).   It appears to be a step towards making it easier for people like me to walk on both sides of this cultural divide.    The package takes all of the outputs of the ‘bcbio’ pipeline and transforms them into an accessible S4 object that can be operated on quickly and simply within R.

Most importantly, the ‘bcbioRNASeq’ module allows for improved graphics and plotting that make communication with biologists easier.  For example, the new MA and volcano plots appear to be based on ‘ggplot2’ graphics and are quite pretty (and have text label options(!), not shown here):

I still need to become familiar with the ‘bcbioRNASeq’ package, but it looks quite promising.   ‘pcaExplorer‘ is another R/Bioconductor package that I feel does a great job of making accessible RNA-seq reports quickly and easy.  I suspect we will see continued improvements to make plots prettier and more informative, with an increasing emphasis on interactive, online plots and notebooks rather than static images.

Confounding in *-seq experiments

I see a lot of experimental design that is confounded.  Please read this important essay to understand why this must be avoided to have quality results from expensive *-seq experiments.

Confounded designs ruin experiments. Current batch effect removal methods will not save you. If you are designing a large genomics experiments, learn about randomization.

 

Artificial neural network classification of copy number variation, part 1

In this series of posts, I want to describe some work I’ve been doing attempting to apply an artificial neural network model to the problem of classifying copy number variation in a clinically-important gene, CYP2D6.   I will be presenting a poster-paper on this topic at the 2018 ISMB/ISCB meeting in Chicago.

Here is the abstract from the poster:

Pharmacogenomics is a rapidly developing field that aims to deliver on the promise of personalized medicine by guiding pharmacological intervention using an understanding of a patient’s individual genotype for drug metabolism.  By avoiding ineffective or dangerous treatments, patient outcomes could be dramatically improved and hospital costs reduced. We have developed a sequencing-based pharmacogenomics screening panel that uses targeted capture to perform deep sequencing of 200+ critical drug metabolism genes.  Assessing copy-number variation is a critical part of correctly interpreting genotypes in key drug metabolism genes, such as CYP2D6.  Historically, this has been a time-consuming step in our clinical pipeline involving large amounts of expert analyst time and data visualization.  This study presents a novel application of an artificial neural network (ANN) machine learning algorithm to learn the complex patterns in CNV data.  The result is a trained network that can quickly and accurately classify copy number events according to known training categories in the CYP2D6 gene.  We show that a simple, one hidden-layer network is sufficient to achieve the extremely high accuracy and low false-positive rate required in a high-throughput clinical setting.

Motivating factors

Motivating this work is the fact that interpreting copy number variation data from targeted capture sequencing is difficult owing to several factors.  First, the data are noisy owing to biases in capture efficiency and GC content.  Second, the copy number variation events in a gene like CYP2D6 are complex and subtle, but have dramatic impacts on the functional status of the gene.  Third, most copy number variation detection methods are optimized for whole genome sequencing with smooth and even coverage across each chromosome.

The result is that, although most of the variant calling and interpretation is automated, a person still has to sit down with the copy number variation copy-ratio plots and make a manual determination of genotype for the CYP2D6 gene (and other genes if they contain clinically-actionable CNVs).   Below is one such copy-ratio plot that illustrates the problems we face:

copy ratio plot
A copy-ratio plot for 9 clinical samples from our targeted sequencing pipeline at the CYP2D6 gene and CYP2D7 pseudogene.  Note the noise in the data.

You can see from the plot that one sample (red) has what appears to be a deletion (copy ratio ~ 0.5).   We do look at these plots individually, but the problem with noise and complexity remains.  This means that the analyst must be, in effect, a domain expert on each gene with a clinically-actionable copy number variation.   This limits pipeline throughput and is impractical if the goal is to process hundreds or thousands of patient samples per month.

A role for machine learning?

It occurred to me that the copy ratio data fall into distinct patterns that a human (like myself) learns by eye and with experience.  These patterns could be classifiable by machine-learning methods.   I considered applying a convolution neural network (CNN) to the plot above.

However, I only had 175 samples in my test/train set drawn from real patients and CEPH/Coriell depositories.   Therefore, I thought it would be best to start with simple models and go from there.  To establish the ground truth CYP2D6 genotype, the samples had all be processed by our current pipeline methods, with many (but not all) confirmed by Taqman assay.

It began to dawn on me that I first needed a better representation of the training data.   I liked the output of the CNV-kit method (below) better than the method from our pipeline (above):

Visualization of CNV-kit output at CYP2D6. The gray and orange lines are not important to this discussion.

The important part of the plot above are the gray dots, representing copy ratio values across 19 different bins or segments along the CYP2D6/2D7 gene and pseudogene.   CNV-kit also had the advantage of being well-documented, fast to run, and available as open-source software on github.

Tidying the training set

The data plotted above looked like this after I did some wrangling and tidying in python:

Copy ratio data from CNV-kit output at the CYP2D6 and 2D7 gene/pseudogene.  There are five columns to the right (four not shown) that contain “one-hot” encoded ground-truth for model training.

With the data in “tidy” form (each column a variable, each row an observation, thank you Hadley Wickham), I was ready to train some machine learning models to see if I could classify common copy number variation events in CYP2D6 and bypass the time-consuming visualization step.

I had no idea if this would work, given that there are only 175 samples (total) to train on and some of the more rare copy number variation events only have a handful of examples.  In part 2, I will talk more about how I tried a LASSO regression model, which, while performing well, failed to yield the high accuracy needed for an automated clinical pipeline.  I then tried a simple one-layer neural network approach.  I’ll talk about the surprisingly good performance of neural network approach and also future directions for this project.

 

 

Decode SAM files with these handy references

I recently had to inspect some genomic alignments as part of a project.  Usually, I am just working with BAM files and if inspection is needed, I just visualize the pileups to see what is going on.

In this case, I just wanted a quick answer to how the reads were aligning to the reference, and I didn’t want to go through the process of subsetting and copying the BAM files to my local machine.

The SAM file is the uncompressed record of the read alignments produced by an aligner method (STAR, TopHat, BWA, etc….).   This file can get very large, and so is usually compressed into BAM (faster for machine parsing, but not human readable) and the SAM file is discarded.

In my case, I still had the SAM files around to inspect.  If you find yourself needing to read a SAM file, here are three helpful reference tools to make the process less painful:

1)  This page has an enormous amount of detail about SAM files including this helpful chart that enumerates all of the fields that you can expect to find specified within each alignment:

SAM file structure explained in this handy chart.

 

2) This post from the blog “zenfractal.com” contains a great exposition on CIGAR strings and how to decode them:

3)  And finally, if you’re trying to decode the SAM bitwise flags, you can calculate them using this tool from the Broad Institute:

Decode SAM flags with this handy online tool from the Broad Institute.