Artificial neural network classification of copy number variation, part 2
Recap:
Welcome to the second part of this post series on building artificial neural network models for copy number classification. In the first part, I described the problem with interpreting copy-ratio plots to find clinically-relevant CNV events. The data from targeted capture deep sequencing are noisy and biased, and finding clinically-relevant genotypes in genes that have CNVs requires the analyst to visualize the CNV event and assign a classification on the basis of experience and expert knowledge.
The LASSO model
Once my training data were in place (see part 1), I used a multiple linear regression LASSO model as a machine-learning benchmark. I did this to determine whether a more powerful neural network model would be warranted. The LASSO model uses an “L1” prior to perform feature selection, setting some coefficients to zero as warranted by the data. There is ample precedent for applying this type of model in bioinformatics settings where the goal is maximize predictive power without overfitting.
I fit the LASSO to the data, with 33% held out for validation. The best fit was obtained with the alpha parameter set to 0.001. k-fold cross validation (where k=10 and alpha=0.001) yielded an accuracy of 76%. These results are surprisingly good, given the complexity of the CNV signals in the noisy data. Unfortunately, 76% accuracy is simply not good enough for an automated method that will be used to predict genotypes in clinical data.
The ANN model
Next, I decided to construct an artificial neural network model. My goal was to keep the model as simple as possible, while reaching a very high classification accuracy needed for clinical work. To that end I constructed a one hidden-layer model with 19 input nodes corresponding to the 19 copy-ratio probes in the CNV data. The output layer contained five nodes, corresponding to the five classes of defined CNV event or other event (for example, a very distinct sequencing artifact that kept appearing in the data):
In between the input and output layers I constructed a 10-node hidden layer. A one hidden-layer neural network is the simplest form of the ANN model, and I tried to keep the number of hidden-layer nodes to a minimum as well. Specific details about the model, hyper-parameter tuning, and the code will be available in the near future when I put a pre-print of this work on biorxiv.
Model training and cross-validation
I trained the model on the 175 sample dataset and on a 350 sample “synthetic” dataset created by adding gaussian “noise” to the real data. The results are shown below, across 250 training epochs.
When the ANN model was tested with 10-fold cross validation, the accuracy reached a level of 96.5% (+/- 5.4%). This is obviously a big improvement on the LASSO model, and reaches a level of accuracy that is good enough for clinical pipelines (with the caveat that low confidence predictions will still be checked “by hand.”)
Below, I’m showing a sample of the model output (left) and ground truth (right) from the test data. The numbers (and colors) of the boxes correspond to the model’s probability in that classification. You can see that most CNV events are called with high probability, but several (yellow boxes) are called correctly but with lower probability. One event (red box) is called incorrectly with high probability.
Conclusions and caveats
Going into this project, I had no idea if the ANN model would be able to make predictions on the basis of so few examples in the training set. The classic examples you see about ANN/CNN models rely on handwriting training sets with 10,000 or more images. So I was surprised when the model did very well with extremely limited training data. Since this method was developed for a clinical pipeline, it can be improved as the pipeline generates new training data with each new patient sample. We would need many thousands of samples through our “legacy” pipeline to see enough examples of the rare star allele events in CYP2D6 that we could then classify them. That is why I limited my CNV calling to three star alleles.
The low confidence, true positive predictions concern me less than the high confidence false negative. Missing a real CNV that has impact on CYP2D6 function and therefore clinical relevance is very dangerous. This can lead to incorrect prescribing and adverse drug reactions for the patient. I really want to understand why the method makes predictions like this, and how to fix it. Unfortunately, I believe it will require a lot more training data to solve this problem and that is something I lack.
My goals for this project now are 1) to publish a preprint on biorxiv describing this work and 2) to obtain some additional training/test datasets. Because our pharmacogenomics test is not generating the kind of volume we expected, I may have to look around for another gene with clinically-relevant CNV events to test this method further. For example, we do have an NGS-based test of hearing and deafness genes with thousands of validated patient samples. One gene, STRC, has relevant CNVs that are complex and require analyst visualization to detect. This may be a good system for follow up refinement of this type of model.
Genomic landscape of metastatic cancer
Integrative genomics sheds new light on metastatic cancer
A new study from the University of Michigan Comprehensive Cancer Center has just been released that represents an in-depth look at the genomics of metastatic cancer, as opposed to primary tumors. This work involved DNA- and RNA-Seq of solid metastatic tumors of 500 adult patients, as well as matched normal tissue sequencing for detection of somatic vs. germline variants.
tl;dr:
A good overview of the study at the level of scientific layperson can be found in this press release. It summarizes the key findings (many of which are striking and novel):
- A significant increase in mutational burden of metastatic tumors vs. primary tumors.
- A long-tailed distribution of mutational frequencies (i.e., few genes were mutated at a high rate, yet many genes were mutated).
- About twelve percent of patients harbored germline variants that are suspected to predispose to cancer and metastasis, and 75% of those variants were in DNA repair pathways.
- Across the cohort, 37% of patient tumors harbored gene fusions that either drove metastasis or suppressed the cells anti-tumor functions.
- RNA-Seq showed that metastatic tumors are significantly de-differentiated, and fall into two classes: proliferative and EMT-like (endothelial-to-mesenchymal transition).
A brief look at the data
This study provides a high-level view onto the mutational burden of metastatic cancer vis-a-vis primary tumors. Figure 1C from the paper shows the comparison of mutation rates in different tumor types in the TCGA (The Cancer Genome Atlas) primary tumors and the MET500 (metastatic cohort).
Here we can see that in most cases (colored bars), metastatic cancers had statistically significant increases in mutational rates. The figure shows that tumors with low mutational rates “sped up” a lot as compared with those primary tumor types that already had high rates.
Supplemental Figure 1d (below) shows how often key tumor suppressor and oncogenes are altered in metastatic cancer vs. primary tumors. TP53 is found to be altered more frequently in metastatic thyroid, colon, lung, prostate, breast, and bladder cancers. PTEN is mutated more in prostate tumors. GNAS and PIK3CA are mutated more in thymoma, although this finding doesn’t reach significance in this case. KRAS is altered more in colon and esophagus cancers, but again, these findings don’t reach significance after multiple correction.
One other figure I’d like to highlight briefly is Figure 3C from the paper, shown below:
I wanted to mention this figure to illustrate the terrifying complexity of cancer. Knowing which oncogenes are mutated, in which positions, and the effects of those mutations on gene expression networks is not enough to understand tumor evolution and metastasis. There are also new genes being created that do totally new things, and these are unique on a per tumor basis. None of the above structures have ever been observed before, and yet they were all seen from a survey of just 500 cancers. In fact, ~40% of the tumors in the study cohort harbored at least one fusion suspected to be pathogenic.
There is much more to this work, but I will leave it to interested readers to go read the entire study. I think this work is obviously tremendously important and novel, and represents the future of personalized medicine. That is, a patient undergoing treatment for cancer will have their tumor or tumors biopsied and sequenced cumulatively over time to understand how the disease has evolved and is evolving, and to ascertain what weaknesses can be exploited for successful treatment.
Unix one-liner to convert VCF to Oncotator format
Here is a handy unix one-liner to process mutect2 output VCF files into the 5 column, tab-separated format required by Oncotator for input (Oncotator is a web-based application that annotates human genomic point mutations and indels with transcripts and consequences). The output of Oncotator is a MAF-formatted file that is compatible with MutSigCV.
#!/bin/bash
FILES='*.vcf.gz'
for file in $FILES
do
zcat $file | grep -v "GL000*" | grep -v "FILTER" | grep "PASS" | cut -d$'\t' -f 1-5 | awk '$3=$2' | awk '$1="chr"$1' > $file.tsv
done
Breaking this down we have:
“zcat $file” : read to stdout each line of a gzipped file
“grep -v “GL000*” : exclude any variant that doesn’t map to a named chromosome
“grep -v “FILTER” : exclude filter header lines
“grep “PASS””: include all lines that pass mutect2 filters
“cut -d$’\t’ -f 1-5” : cut on tabs and keep fields one through five
“awk ‘$3=$2’ : set column 3 equal to column 2, i.e., start and end position are equal
“awk $1=’chr’$1″” : set column one equal to ‘chr’ plus column one (make 1 = chr1)
Variant annotation and transcript choice
Transcript choice between methods
Variant annotation methods do not all behave the same way when choosing transcripts to annotate against. This leads to differing outcomes in annotations which may arise from different logic structures in the algorithms or different user criteria for annotation.
Unfortunately, incorrect annotations or disagreement in annotation outcomes can lead investigators to waste resources tracking down variants of little interest or to miss severe variants of potential clinical significance.
In this first post in a series, I’ll talk briefly about differing outcomes owing to transcript choices when three popular methods (ANNOVAR, VEP, and SNPEff) are applied to a dataset of 81 million variants from the 1000Genomes project.
In this figure you can see the lack of concordance owing to transcript choice affects a surprisingly large number of variants.
This disagreement is largely owing to the way that intergenic variants are handled, assigning them to nearest genes or arbitrary categories like “unknown.”
To learn more about this problem and other issues with annotators and concordance between methods, check out our recent paper at biorxiv. In part two, I’ll talk more about concordance between methods when annotators agree on transcript choice.