MixOmics: a “swiss army knife” for -omics integration

Introduction

Genomics is the study of an organism’s complete set of genetic material, including its DNA sequence, genes, and regulation of gene expression. Other “omics” techniques, such as proteomics and metabolomics, focus on the study of proteins and metabolites, respectively. By analyzing these different types of data together, researchers can generate new insights into the inner workings of an organism and how it responds to its environment.

For example, by combining genomics data with proteomics and metabolomics data, researchers can gain a more complete understanding of an organism’s gene expression, protein production, and metabolic processes, and how these processes work together to create health, or dysfunction to create disease. This knowledge can provide valuable insights for a wide range of applications, including drug development, disease diagnosis, and environmental monitoring.

Finding correlations between related datasets means looking for patterns or relationships between different sets of data. This can provide valuable insights into the underlying biological processes and functions of an organism. For example, if two datasets show a strong positive correlation, it suggests that they are related in some way and that changes in one dataset may be associated with changes in the other. By identifying these correlations, researchers can better understand the mechanisms behind biological processes and how they are regulated. This can be useful for a variety of applications, such as predicting the effects of potential drugs or identifying potential targets for medical intervention.

I have surveyed the literature for tools to integrate multiple -omics datasets together. As is the case for any task in bioinformatics, there are dozens of options. However, when I considered criteria such as ease of installation, documentation quality, robust user community, user support and published analyses, I believe that the “mixOmics” package (available for download and installation from Bioconductor) is one of the best tools out there for doing this type of integration analysis.

The mixOmics approach

The mixOmics package encompasses many different versions of multivariate algorithms for integrating multiple datasets. Multivariate analysis is well-suited to this problem space where there are far more features than samples. By reducing the dimension of the data, the analysis makes it easier for a human analyst to see patterns and interpret correlations. One of the most common types of algorithm in mixOmics for doing this is called “partial least squares.”

Fig 1. An overview of the mixOmics package. The methods can handle single ‘omics, multiple ‘omics on the same samples (N-integration), and the same ‘omics on multiple sets of samples (P-integration) to find correlations in the data. Some examples of the graphic outputs are shown in the bottom two panels of the figure.

The partial least squares (PLS) method is a mathematical technique used to analyze relationships between two or more datasets. It works by identifying the underlying patterns and correlations in the data, and then using this information to construct a set of “composite” variables that capture the most important features of the data (this is analogous to PCA analysis, but differs by focusing on maximizing correlation/covariance among latent variables).

These composite (latent) variables can then be used to make predictions or draw conclusions about the relationships between the datasets. For example, if two datasets are known to be related in some way, the PLS method can be used to identify the specific features of each dataset that are most strongly correlated with the other, and then construct composite variables based on these features. PLS is more robust than PCA to highly correlated features and can be used to make predictions between the dependent and independent variables.

mixOmics takes the PLS method a step further by integrating a ‘feature-selection’ option called “sparse PLS” or just “sPLS” that uses “lasso” penalization to reduce unnecessary features from the final model to aid interpretation and also computational time. Lasso regression works by adding a regularization term to the ordinary least squares regression model, which is a measure of the complexity of the model. This regularization term, called the “lasso,” forces the coefficients of the model to be zero for the less important predictors, effectively eliminating them from the model.

This results in a simpler and more interpretable model that is better able to make accurate predictions. Lasso regression is particularly useful for datasets with a large number of predictors, as it can help to identify the most important predictors and reduce the risk of overfitting the model.

Conclusion

In future posts, I will describe in more detail what can be done with MixOmics and show some results from our own studies that have produced stunningly detailed and intricate correlation networks. If you are interested in this kind of work, I would encourage you to check out mixOmics as a possible avenue for analysis. There are other packages, and many of them are excellent, but the learning curve of mixOmics is quite shallow and it is well supported with a dynamic and active user community. It is also very flexible to different experimental scenarios, so that you can analyze your data several different ways while using the same package and R script.

Matataki, a new ultrafast read mapper: hope or hype?

tl;dr:  It’s only 2X faster than kallisto; a marginal, rather than “ultra” improvement.  If you’re already using kallisto in your pipelines there’s no reason to switch.   Now onto the blog post: 

Another RNA-seq read aligner

In July, a paper appeared in BMC Bioinformatics: “Matataki: an ultrafast mRNA quantification method for large-scale reanalysis of RNA-Seq data.”   The title seems to imply that Matataki is much faster than anything else out there.  After all, what does ‘ultrafast’ mean unless it’s in comparison to the next fastest method?  The authors claim it also solves a unique problem of large-scale RNA-seq reanalysis that can’t be solved any other way.  Is this true?  Let’s dive into the details and find out.

The authors point out in the introduction that the amount of publicly available RNA-seq data is increasing rapidly and that many would like to do large-scale reanalysis of these datasets.   Reanalysis (including realignment) is necessary because of the large batch effects of using different methods for alignment and quantification.

Although these alignment-free methods such as Sailfish, Kallisto, and RNA-Skim are much faster than the alignment-based methods, the recent accumulation of large-scale sequence data requires development of an even faster method for data management and reanalysis.

Counting unique k-mers

The method itself eschews traditional alignment for a k-mer based read counting approach.  This approach takes two steps:  build an index and quantify the reads against it.  Building the index requires the algorithm to search for all k-mers unique to a gene and also present in all isoforms of that gene (to avoid effects of differential expression of isoforms).  Quantification proceeds by matching the k-mers from a read with the unique k-mers in the index.  If there is only one indexed gene match from several k-mers (with a minimum cutoff set by the user) that read is assigned to the indexed gene.  If there is more than one match to several genes, the read is discarded.

If this sounds a lot like how kallisto operates under the hood, I would agree with you as would the authors:

Similar to Kallisto, our method uses k-mers that appear only once in a gene and quantifies expression from the number of unique k-mers.

Of course, kallisto creates a transcriptome de bruijn graph (tDBG) and uses compatibility classes to decide whether to skip k-mers while hashing a read.  Matataki does not use a tDBG and simply skips along a read using a fixed interval set by the user.   The code also skips the Expectation-Maximization (EM) step found in kallisto and focuses only on gene-level quantification.   These two novel approaches are what set Matataki apart from other pseudo-alignment algorithms, according to the authors.

But does it work?

Well, it appears to.  The authors start by reporting the performance of the method on simulated data using optimized parameters (I’ll discuss this below).  In Figure 2,  they describe the parameter tuning that led to those choices.  For parameter tuning the performance of Matataki on real data, the TPM values from an eXpress run on sample ERR188125 were taken as “ground truth.”  The choice of ‘eXpress’ as their benchmark for real-world dataset quantification is odd, given that eXpress is defunct and has been for some time.  It also takes orders of magnitude longer to complete than a method like sailfish or kallisto.

Fig 2. Comparison of TPM when k was varied. The x-axis shows the TPM values of eXpress, the y-axis shows the TPM values of our method, and the color indicates the indexed k-mer coverage of each gene when changing k from 16 to 56 with a step of 8

Going back to Figure 1 (below) and the performance on simulated data, we can observe accuracy similar to that of kallisto, RSEM, sailfish and other methods  (note that by using a scale from 0.93 to 0.99, the differences between the methods seem exaggerated).  The one simulated sample was created by RSEM from the real sample plus three other related samples.  So perhaps it’s not surprising that RSEM does the best in this comparison overall.

I think it’s important to remember as well that Matataki is being run on this simulated data with optimized parameters for this sample while the other methods are run with “default” parameters.   We don’t know how Figure 1a and 1b would change for other samples, other organisms, and other sequencing types (PE vs. SE, long vs. short, etc…)  It would be enlightening to find out.

Figure 1. Summary of the results using simulation data. a Spearman correlation coefficient with the expected expression and estimated expression values using each method. “Matataki” indicates the results of the proposed method, and “MatatakiSubset” indicates the results of the proposed method without uncovered genes. To compare the gene-level expression profile and transcript-level expression profile, the sum of TPM by each gene was calculated. b Means of absolute difference from the expected expression levels.

So how fast is ‘ultrafast’ anyway?

We finally get to the promised speed improvements near the end of the manuscript, only to find that the speedup relative to kallisto, a method that has been available publicly since at least 2015, is only 2-fold.  Yep, that’s it.  It’s only twice as fast.   This is an incremental improvement, but not a paradigm shift.  In fact, it shows how the leap from alignment to pseudo-alignment was in fact a paradigm shift when kallisto first came out.  It’s been three years and nothing has substantially improved on it.

Fig 3. CPU time comparison.

Limitations?

Finally, the authors end the paper with a section on ‘expected use-cases and limitations’ where they say that the method is only suitable for large-scale reanalysis and not for normal, turn-the-crank RNA-seq.

Nevertheless, Matataki is not suitable for common RNA-Seq purposes because other methods are sufficiently fast and provide better accuracy. For example, a single nucleotide substitution has larger effects in Matataki than in other methods, because even a single point substitution changes the k-mer for 2 k − 1 bases, which ultimately affects the number of k-mers in a transcript and calculation of the TPM value.

For reasons that I don’t understand, the authors promote this tool (which is only 2X faster than kallisto) as being for large-scale reanalysis, yet they performed no large-scale reanalysis to benchmark the tool.   Rather, they used it to perform normal RNA-seq as a benchmark, which they go on to say not to do in the wild.

To sum up, while a 2X speedup is useful, I see no reason to abandon kallisto to pseudoalign and quantify data when doing large scale reanalysis.   Unlike Matataki, kallisto is robust to read errors and it delivers transcript-level estimates that can be later summed to gene level, if you so desire.  It also provides bootstrapped estimates of technical variation, which helps to understand the biological variation as well.   And finally, kallisto integrates nicely with sleuth for DE testing and visualization.

Breakthrough advances in 2018 so far: flu, germs, and cancer

2018 medicine breakthrough review!

So far this year has seen some pretty important research breakthrough advances in several key areas of health and medicine.  I want to briefly describe some of what we’ve seen in just the first few months of 2018.

Flu

A pharmaceutical company in Japan has released phase 3 trial results showing that its drug, Xofluza, can effectively kill the virus in just 24 hours in infected humans.  And it can do this with just one single dose, compared to a 10-dose, three day regimen of Tamiflu. The drug works by inhibiting an endonuclease needed for replication of the virus.

Germs

It is common knowledge that antibiotics are over-prescribed and over-used.  This fact has led to the rise of MRSA and other resistant bacteria which threaten human health.  Although it is thought that bacteria could be a source of novel antibiotics since they are in constant chemical warfare with each other, most bacteria aren’t culture-friendly in the lab and so researchers haven’t been looking at them for leads.  Until now.

Malacidin drugs kill multi-drug resistant S. Aureus in tests on rats.

By adopting whole genome sequencing approaches to soil bacterial diversity, researchers were able to screen for gene clusters associated with calcium-binding motifs known for antibiotic activity.   The result was the discovery of a novel class of lipo-peptides, called malacidins A and B.  They showed potent activity against MRSA in skin infection models in rats.

The researchers estimate that 99% of bacterial natural-product antibiotic compounds remain unexplored at present.

Cancer

2017 and 2018 have seen some major advances with cancer treatment.   It seems that the field is moving away from the focus on small-molecule drugs towards harnessing the patient’s own immune system to attack cancer.  The CAR-T therapies for pediatric leukemia appear extremely promising.  These kinds of therapies are now in trials for a wide range of blood and solid tumors.

A great summary of the advances being made is available here from the Fred Hutchinson Cancer Research Center.   Here is how Dr. Gilliland, President of Fred Hutch, begins his review of the advances:

I’ve gone on record to say that by 2025, cancer researchers will have developed curative therapeutic approaches for most if not all cancers.

I took some flak for putting that stake in the ground. But we in the cancer research field are making incredible strides toward better and safer, potentially curative treatments for cancer, and I’m excited for what’s next. I believe that we must set a high bar, execute and implement — that there should be no excuses for not advancing the field at that pace.

This is a stunning statement on its own;  but made even more so because it is usually the scientists in the day-to-day trenches of research who are themselves the most pessimistic about the possibility of rapid advances.

Additionally, an important paper came out recently proposing a novel paradigm for understanding and modeling cancer incidence with age.  For a long time the dominant model has been the “two-hit” hypothesis which predicts that clinically-observable cancers arise when a cell acquires sufficient mutations in tumor-suppressor genes to become a tumor.

This paper challenges that notion and shows that a model of thymic function decline (the thymus produces T-cells) over time better describes the incidence of cancers with age.   This model better fits the data and leads to the conclusion that cancers are continually arising in our bodies, but it is our properly functioning immune system that roots them out and prevents clinical disease from emerging.  This model also helps explain why novel cancer immunotherapies are so potent and why focus has shifted to supporting and activating T-cells.

Declining T cell production leads to increasing disease incidence with age.

 

Genomic landscape of metastatic cancer

Integrative genomics sheds new light on metastatic cancer

A new study from the University of Michigan Comprehensive Cancer Center has just been released that represents an in-depth look at the genomics of metastatic cancer, as opposed to primary tumors.   This work involved DNA- and RNA-Seq of solid metastatic tumors of 500 adult patients, as well as matched normal tissue sequencing for detection of somatic vs. germline variants.

tl;dr:

A good overview of the study at the level of scientific layperson can be found in this press release.  It summarizes the key findings (many of which are striking and novel):

  • A significant increase in mutational burden of metastatic tumors vs. primary tumors.
  • A long-tailed distribution of mutational frequencies (i.e., few genes were mutated at a high rate, yet many genes were mutated).
  • About twelve percent of patients harbored germline variants that are suspected to predispose to cancer and metastasis, and 75% of those variants were in DNA repair pathways.
  • Across the cohort, 37% of patient tumors harbored gene fusions that either drove metastasis or suppressed the cells anti-tumor functions.
  • RNA-Seq showed that metastatic tumors are significantly de-differentiated, and fall into two classes:  proliferative and EMT-like (endothelial-to-mesenchymal transition).

 A brief look at the data

This study provides a high-level view onto the mutational burden of metastatic cancer vis-a-vis primary tumors.  Figure 1C from the paper shows the comparison of mutation rates in different tumor types in the TCGA (The Cancer Genome Atlas) primary tumors and the MET500 (metastatic cohort).

Mutational burden in metastatic cancer compared to primary tumors.

 

Here we can see that in most cases (colored bars), metastatic cancers had statistically significant increases in mutational rates.   The figure shows that tumors with low mutational rates “sped up” a lot as compared with those primary tumor types that already had high rates.

Supplemental Figure 1d (below) shows how often key tumor suppressor and oncogenes are altered in metastatic cancer vs. primary tumors.  TP53 is found to be altered more frequently in metastatic thyroid, colon, lung, prostate, breast, and bladder cancers.   PTEN is mutated more in prostate tumors.  GNAS and PIK3CA are mutated more in thymoma, although this finding doesn’t reach significance in this case.  KRAS is altered more in colon and esophagus cancers, but again, these findings don’t reach significance after multiple correction.

Comparison of genetic alteration frequencies in metastatic and primary tumors.

 

One other figure I’d like to highlight briefly is Figure 3C from the paper, shown below:

Molecular structure of novel, potentially activating gene fusions in the metastatic tumors.

I wanted to mention this figure to illustrate the terrifying complexity of cancer.   Knowing which oncogenes are mutated, in which positions, and the effects of those mutations on gene expression networks is not enough to understand tumor evolution and metastasis.  There are also new genes being created that do totally new things, and these are unique on a per tumor basis.   None of the above structures have ever been observed before, and yet they were all seen from a survey of just 500 cancers.   In fact, ~40% of the tumors in the study cohort harbored at least one fusion suspected to be pathogenic.

There is much more to this work, but I will leave it to interested readers to go read the entire study.   I think this work is obviously tremendously important and novel, and represents the future of personalized medicine.  That is, a patient undergoing treatment for cancer will have their tumor or tumors biopsied and sequenced cumulatively over time to understand how the disease has evolved and is evolving, and to ascertain what weaknesses can be exploited for successful treatment.