Artificial neural network classification of copy number variation, part 2

Recap:

Welcome to the second part of this post series on building artificial neural network models for copy number classification.  In the first part, I described the problem with interpreting copy-ratio plots to find clinically-relevant CNV events.  The data from targeted capture deep sequencing are noisy and biased, and finding clinically-relevant genotypes in genes that have CNVs requires the analyst to visualize the CNV event and assign a classification on the basis of experience and expert knowledge.

The LASSO model

Once my training data were in place (see part 1), I used a multiple linear regression LASSO model as a machine-learning benchmark.  I did this to determine whether a more powerful neural network model would be warranted.   The LASSO model uses an “L1” prior to perform feature selection, setting some coefficients to zero as warranted by the data.   There is ample precedent for applying this type of model in bioinformatics settings where the goal is maximize predictive power without overfitting.

I fit the LASSO to the data, with 33% held out for validation.  The best fit was obtained with the alpha parameter set to 0.001.  k-fold cross validation (where k=10 and alpha=0.001) yielded an accuracy of 76%.   These results are surprisingly good, given the complexity of the CNV signals in the noisy data.  Unfortunately, 76% accuracy is simply not good enough for an automated method that will be used to predict genotypes in clinical data.

The ANN model

Next, I decided to construct an artificial neural network model.  My goal was to keep the model as simple as possible, while reaching a very high classification accuracy needed for clinical work.   To that end I constructed a one hidden-layer model with 19 input nodes corresponding to the 19 copy-ratio probes in the CNV data.  The output layer contained five nodes, corresponding to the five classes of defined CNV event or other event (for example, a very distinct sequencing artifact that kept appearing in the data):

 

In between the input and output layers I constructed a 10-node hidden layer.   A one hidden-layer neural network is the simplest form of the ANN model, and I tried to keep the number of hidden-layer nodes to a minimum as well.  Specific details about the model, hyper-parameter tuning, and the code will be available in the near future when I put a pre-print of this work on biorxiv.

Model training and cross-validation

I trained the model on the 175 sample dataset and on a 350 sample “synthetic” dataset created by adding gaussian “noise” to the real data.  The results are shown below, across 250 training epochs.

When the ANN model was tested with 10-fold cross validation, the accuracy reached a level of 96.5% (+/- 5.4%).   This is obviously a big improvement on the LASSO model, and reaches a level of accuracy that is good enough for clinical pipelines (with the caveat that low confidence predictions will still be checked “by hand.”)

Below, I’m showing a sample of the model output (left) and ground truth (right) from the test data.  The numbers (and colors) of the boxes correspond to the model’s probability in that classification.  You can see that most CNV events are called with high probability, but several (yellow boxes) are called correctly but with lower probability.   One event (red box) is called incorrectly with high probability.

Conclusions and caveats

Going into this project, I had no idea if the ANN model would be able to make predictions on the basis of so few examples in the training set.  The classic examples you see about ANN/CNN models rely on handwriting training sets with 10,000 or more images.   So I was surprised when the model did very well with extremely limited training data.   Since this method was developed for a clinical pipeline, it can be improved as the pipeline generates new training data with each new patient sample.  We would need many thousands of samples through our “legacy” pipeline to see enough examples of the rare star allele events in CYP2D6 that we could then classify them.  That is why I limited my CNV calling to three star alleles.

The low confidence, true positive predictions concern me less than the high confidence false negative.  Missing a real CNV that has impact on CYP2D6 function and therefore clinical relevance is very dangerous.  This can lead to incorrect prescribing and adverse drug reactions for the patient.   I really want to understand why the method makes predictions like this, and how to fix it.  Unfortunately, I believe it will require a lot more training data to solve this problem and that is something I lack.

My goals for this project now are 1) to publish a preprint on biorxiv describing this work and 2) to obtain some additional training/test datasets.   Because our pharmacogenomics test is not generating the kind of volume we expected, I may have to look around for another gene with clinically-relevant CNV events to test this method further.   For example, we do have an NGS-based test of hearing and deafness genes with thousands of validated patient samples.  One gene, STRC, has relevant CNVs that are complex and require analyst visualization to detect.  This may be a good system for follow up refinement of this type of model.

Filtering variants for cancer mutational signature analysis

Recently, I’ve been working to help prepare a manuscript on Vestibular Schwannomas (VS), a type of benign cancer of the myelin-forming cells along the nerves of the ear.  I’ve been thinking a lot about strategies for filtering exome variant calls to feed into mutational signature analysis.

Mutational signatures are important because they describe the types of mutational processes operating on the genome of the tumor cells.  Many of these processes are known (see the COSMIC database), however, some are entirely novel.  The variants that are used for calculating such signatures are somatic in nature, and have to be carefully curated from the raw variant calls that you get from a pipeline like GATK.

Looking at the existing literature, I find that there is no common or “best practices” methodology for filtering variants in whole exome data.  Some groups are very stringent, others less so.  The first step in most cases is to just subtract normal variant calls from tumor in most cases.  However, there are further filtering steps that should be undertaken.

If I had to describe some overall commonalities in the literature approaches to somatic variant filters, it could include:

1) removing variants that are present in dbSNP or 1000genomes or other non-cancer exome data
2) taking only variants in coding regions (exons) or splicing sites
3) variants must appear in more than X reads in the tumor, and fewer than X reads in the normal (generally ~5 and ~2, respectively)
4) subtraction of “normals” from “tumor” (either pooled normals, or paired)
5) variant position must be covered by a minimum depth (usually > 10X)
6) throwing away reads from low mapping quality (MQ) regions

Some papers only consider non-synonymous variants, but for mutational signatures, to me it makes sense to take all validated variants (especially in exome data because you are starting with fewer raw variant calls than whole genome data).

As far as actual numbers of variants that are “fed” into the mutational signature analysis, most papers do not report this directly (surprisingly).  If you dig around in the SI sections, sometimes you can find it indirectly.

It looks like, generally, the number of variants is somewhere around 10,000 for papers dealing with specific tumor types (not pan-cancer analyses of public databases). Several papers end up with ~1000 variants per tumor (ranging from 1,000 up to 7,000).  So with 10 tumors sequenced, that would be 10,000 filtered, high-confidence SNVs.

If you’re working on exome mutational signature analysis and you have your own filtering criteria, I’d love for you to share it in the comments.