Conference report: GLBIO2019

I just returned from another great experience at Great Lakes Bio 2019 (#GLBIO2019), a regional meeting of the International Society of Computational Biologists (ISCB). Below I’ll summarize briefly a few of the talks that I found most interesting to me personally (there were several parallel tracks, so I did not attend all talks).

Docker workshop taught by Sara Stevens

On Sunday of the conference, I attended a 3-hour workshop introducing Docker technology held in the beautiful and very modern Wisconsin Institutes of Discovery building. The course was taught by Sara Stevens, an expert in data science and bioinformatics with the data science hub at UWisconsin-Madison.

We worked through an initial “hello world” application of Docker on our laptops, writing a Dockerfile that became an image and finally a container instance of that image:

mchiment@MNE762:~/Desktop/docker-playground/my-greeting$cat Dockerfile 
#specify the base image
FROM alpine
#specify what to build
RUN /bin/echo "greeting!" > /root/my_message
#give default command
CMD ["/bin/cat", "/root/my_message"]

Then we progressed into more complex Dockerfile builds, including one that would install a mini-python distro and run a program. This included installing some libraries with pip within the image, and running a script.

Overall, I learned a lot and got a good grasp of the Docker basics to build upon for future work.

Integrative analysis for fine mapping of genetic variants, Sunduz Keles

In this talk, the issue of how to make sense of GWAS data was addressed. If you have a collection of SNPs, how to you follow up with which genes to study, which mechanisms to propose, etc… This talk introduced a tool, atSNP Search, which uses transcription factor position-weight matrices (PWMs) and assesses the impact of a SNP on TF DNA-binding activity within the local area of the SNP using the PWMs.

From the website:

atSNP identifies and quantifies best DNA sequence matches to the transcription factor PWMs with both the reference and the SNP alleles in a small window around the SNP location (up to +/- 30 base pairs and considering subsequences spanning the SNP position). It evaluates statistical significance of the match scores with each allele and calculates statistical significance of the score difference between the best matches with the reference and SNP alleles.

The talk also introduced a method, “FM-HighLD”, which asks whether you can substitute functional annotations of SNPs for “massive parallel reporter arrays” (MPRAs) which are considered “gold standard” for SNP/eQTL function. The idea is to use MPRA results and their correlation to functional annotations to calibrate the model and then apply that to eQTLs or GWAS SNPs with no MPRA results, but functional annotations from public databases.

refine.bio

There is over $4 Billion worth of publicly-funded RNAseq and microarray data in the public repositories. Studies have shown that analysts can spend up to 30% of a project’s time just searching, accessing, downloading, and preprocessing these data.

Refine.bio is an attempt to “harmonize” thousands of gene expression datasets by downloading and pre-processing them using a common pipeline and common reference. This is only possible owing to the innovation of pseudo-alignment in methods like kallisto and salmon.

In the background, refine.bio runs on Amazon Web Services, which gives the project unlimited compute and storage to scale according to their needs. In addition to standardized gene expression processing, sample metadata are also harmonized, where keywords are mapped to standard ontologies for ease of comparison.

Monitoring crude oil spills with 16S and machine-learning, Stephen Techtmann

In this work, Dr. Techtmann’s group was interested in looking at the response of fresh water microbiomes drawn from Lake Superior to the introduction of different types of oil (a complex chemical substance that acts as a carbon food source). Their team drew lake water samples and incubated them with different oils (heavy crude, refined crude, etc…) and then assessed taxonomic abundance using 16S amplicon gene sequencing.

The taxa abundances were used to train a Random Forest model to predict oil contamination status. RF methods produced a model with extremely high accuracy, AUC > 0.9. They found that two taxa predominantly distinguish the oil samples from the lake water samples.

Artificial neural network classification of copy number variation, part 1

In this series of posts, I want to describe some work I’ve been doing attempting to apply an artificial neural network model to the problem of classifying copy number variation in a clinically-important gene, CYP2D6.   I will be presenting a poster-paper on this topic at the 2018 ISMB/ISCB meeting in Chicago.

Here is the abstract from the poster:

Pharmacogenomics is a rapidly developing field that aims to deliver on the promise of personalized medicine by guiding pharmacological intervention using an understanding of a patient’s individual genotype for drug metabolism.  By avoiding ineffective or dangerous treatments, patient outcomes could be dramatically improved and hospital costs reduced. We have developed a sequencing-based pharmacogenomics screening panel that uses targeted capture to perform deep sequencing of 200+ critical drug metabolism genes.  Assessing copy-number variation is a critical part of correctly interpreting genotypes in key drug metabolism genes, such as CYP2D6.  Historically, this has been a time-consuming step in our clinical pipeline involving large amounts of expert analyst time and data visualization.  This study presents a novel application of an artificial neural network (ANN) machine learning algorithm to learn the complex patterns in CNV data.  The result is a trained network that can quickly and accurately classify copy number events according to known training categories in the CYP2D6 gene.  We show that a simple, one hidden-layer network is sufficient to achieve the extremely high accuracy and low false-positive rate required in a high-throughput clinical setting.

Motivating factors

Motivating this work is the fact that interpreting copy number variation data from targeted capture sequencing is difficult owing to several factors.  First, the data are noisy owing to biases in capture efficiency and GC content.  Second, the copy number variation events in a gene like CYP2D6 are complex and subtle, but have dramatic impacts on the functional status of the gene.  Third, most copy number variation detection methods are optimized for whole genome sequencing with smooth and even coverage across each chromosome.

The result is that, although most of the variant calling and interpretation is automated, a person still has to sit down with the copy number variation copy-ratio plots and make a manual determination of genotype for the CYP2D6 gene (and other genes if they contain clinically-actionable CNVs).   Below is one such copy-ratio plot that illustrates the problems we face:

copy ratio plot
A copy-ratio plot for 9 clinical samples from our targeted sequencing pipeline at the CYP2D6 gene and CYP2D7 pseudogene.  Note the noise in the data.

You can see from the plot that one sample (red) has what appears to be a deletion (copy ratio ~ 0.5).   We do look at these plots individually, but the problem with noise and complexity remains.  This means that the analyst must be, in effect, a domain expert on each gene with a clinically-actionable copy number variation.   This limits pipeline throughput and is impractical if the goal is to process hundreds or thousands of patient samples per month.

A role for machine learning?

It occurred to me that the copy ratio data fall into distinct patterns that a human (like myself) learns by eye and with experience.  These patterns could be classifiable by machine-learning methods.   I considered applying a convolution neural network (CNN) to the plot above.

However, I only had 175 samples in my test/train set drawn from real patients and CEPH/Coriell depositories.   Therefore, I thought it would be best to start with simple models and go from there.  To establish the ground truth CYP2D6 genotype, the samples had all be processed by our current pipeline methods, with many (but not all) confirmed by Taqman assay.

It began to dawn on me that I first needed a better representation of the training data.   I liked the output of the CNV-kit method (below) better than the method from our pipeline (above):

Visualization of CNV-kit output at CYP2D6. The gray and orange lines are not important to this discussion.

The important part of the plot above are the gray dots, representing copy ratio values across 19 different bins or segments along the CYP2D6/2D7 gene and pseudogene.   CNV-kit also had the advantage of being well-documented, fast to run, and available as open-source software on github.

Tidying the training set

The data plotted above looked like this after I did some wrangling and tidying in python:

Copy ratio data from CNV-kit output at the CYP2D6 and 2D7 gene/pseudogene.  There are five columns to the right (four not shown) that contain “one-hot” encoded ground-truth for model training.

With the data in “tidy” form (each column a variable, each row an observation, thank you Hadley Wickham), I was ready to train some machine learning models to see if I could classify common copy number variation events in CYP2D6 and bypass the time-consuming visualization step.

I had no idea if this would work, given that there are only 175 samples (total) to train on and some of the more rare copy number variation events only have a handful of examples.  In part 2, I will talk more about how I tried a LASSO regression model, which, while performing well, failed to yield the high accuracy needed for an automated clinical pipeline.  I then tried a simple one-layer neural network approach.  I’ll talk about the surprisingly good performance of neural network approach and also future directions for this project.

 

 

Get hands-on with t-SNE plots

With the growing popularity of single-cell RNA-Seq analysis, the t-SNE projection of multi-dimensional data is appearing more often in publications and online.  If you’ve ever wanted to develop a better intuitive feel for what exactly t-SNE does and where it can go wrong, this interactive tutorial (by Martin Wattenberg and Fernanda Viegas) is extremely compelling and useful.

A screen capture of the interactive t-SNE interface.

In addition to providing a wonderful, interactive plotting function, the authors go on to provide an informative tutorial explains the pitfalls and challenges of the optimization and hyper-parameter tuning of t-SNE projections and how to get the most from the plots.  Here is an example:

An example of how hyperparameter tuning affects the final plot.

In the example above, tuning the “perplexity” of the t-SNE projection causes the correct reconstruction of the data when values are between 30-50, but the same method fails when the parameter falls outside those ranges (i.e., too small or too large).

Go check out this distill.pub site.  It’s worth your time.

Are deep neural nets “Software 2.0”?

Image from: https://cdn.edureka.co/blog/wp-content/uploads/2017/05/Deep-Neural-Network-What-is-Deep-Learning-Edureka.png

Recent blog posts by Andrej Karpathy at Medium.com and Pete Warden at PeteWarden.com have caused a paradigm shift in the way I think about neural nets.  Instead of thinking of them as powerful machine learning tools, the authors  instead suggest that we should think of neural nets, and in particular, convolution deep nets, as ‘self-writing programs.’   Hence the term, “Software 2.0.”

It turns out that a large portion of real-world problems have the property that it is significantly easier to collect the data than to explicitly write the program. A large portion of programmers of tomorrow do not maintain complex software repositories, write intricate programs, or analyze their running times. They collect, clean, manipulate, label, analyze and visualize data that feeds neural networks.   — Andrej Karpathy, Medium.com

I found this to be a dramatic reversal in my thinking about these techniques, but it opens up a deeper understanding and is much more intuitive.  The fact is that combinations of artificial neurons can be used to model any logical operation.  Therefore you can conceptualize training a neural net as searching programming space for an optimal program that behaves in the way you specify.  You provide the inputs and desired outputs, and the model searches for the optimal program.

This stands in contrast to the “Software 1.0” paradigm where the programmer uses her skill and experience to conceptualize the right combination of specific instructions to produce the desired behavior.   While it seems certain that Software 1.0 and 2.0 will co-exist for a long time, this new way of understanding deep learning is crucial and exciting, in my opinion.

 

 

Five (easy) ways to start learning about convolution neural nets

A schematic of a Convolution Neural Network (CNN).

Here are five different ways to gain an introduction to the topic of CNNs.  Each approach is geared toward a different style of learning:

1

Visualize them in real time with your own inputs (this is amazing!)

2

Watch a lecture by the “godfather” of neural nets,  Geoff Hinton.

3

Take a top-ranked online course on Deep Learning.

4

Learn the math behind them.

5

Code one yourself in python.