A brief look at machine-learning powered literature search

Machine-learning (ML) and neural networks are transforming data science and life sciences. They are being applied to deal with the challenges of making sense of piles of ‘big data’ that are growing bigger all the time.

Now, these same tools are now being applied to searching the gigantic scientific literature databases (PubMed contains > 30M citations) in order to bring more relevant results to researchers.

A simple PubMed search proceeds by matching terms like the following:

…if you enter child rearing in the search box, PubMed will translate this search to: “child rearing”[MeSH Terms] OR (“child”[All Fields] AND “rearing”[All Fields]) OR “child rearing”[All Fields]

https://www.ncbi.nlm.nih.gov/books/NBK3827/

If you want to get potentially more sophisticated than simply searching on matching terms, like PubMed, take a look at the methods below. Without having used each one extensively, it’s difficult for me to tell if the results are an improvement on PubMed or Google, but let’s just jump in an explore each one briefly:

Semantic Scholar

First up is Semantic Scholar. According to the “about me” page, SS is aimed at helping researchers find relevant publications faster. It analyzes whole documents and extracts meaningful features using various types of ML. The authors claim that this method results in finding influential citations, key images and phrases, and allows the researcher to focus on impactful publications first. They claim to index 176M articles, and have filters for high-quality publications. Detail about this are scarce however.

A search results page from Semantic Scholar search for “single-cell RNA-seq”

The search results appear to have some nice features. Above is a screencap of the results for a “single-cell RNA-seq” search. In the image below, you can see that beneath each paper title and abstract are a couple of numbers in orange. The number on the left is the number of “highly influential citations.” This is the number of papers where this paper played an important role in the citing paper. The second number on the right is the “citation velocity” which represents the average number of citations per year for that work. Then there are several more useful buttons, including a link out, a button that brings up the citation in a variety of formats, a “save” button, and a button to add the paper to my Paperpile library.

Clicking through on one paper yields a page that looks like this:

A results page from Semantic Scholar. Key figures are pulled out and highlighted for quick viewing. Key topics covered in the work are shown on the right.

This nice, clean interface makes it easy to absorb the content of the paper, including browsing the abstract and key figures. You also have a metrics box in the upper right that shows how many times the paper is cited, how many are “highly influenced”, and where in the citing papers this paper is referenced. The headings across the middle of the results page break down the sections that are below. These include “Figures and Topics”, “Media Mentions” where SS finds blog posts and online reports that mention this topic, “Citations” which is a list of the citing papers, “References” which is the papers referenced by this paper, and “Similar Papers” which are papers that cover related topics.

Iris.ai

Iris.ai is machine-learning tool that uses neural networks to build knowledge graphs about publications. The “about me” section includes a cutesy intro in the first person, as if the algorithm were just a really smart person reading a lot of papers and not a research project. Anyway, Iris claims to have “read” at least 77M papers in the core database. There is a good article here detailing the evolution of Iris since her founding in 2016. And the Iris.AI blog is a good place to learn of updates to the method.

When you perform a search with Iris.ai the interface looks like this:

Above. The search interface for Iris.ai.

This looks like a standard search bar, but instead of searching keywords you either input a URL of a paper you are interested in, or you write a title and 300 word paragraph describing a research problem. So there is some work on the front end to get to useful results, but possibly worth it if you need to deep dive into the literature. Let’s take a look at those results below.

Above: Search results for the paper “CNVkit: Genome-Wide Copy Number Detection and Visualization from Targeted DNA Sequencing.”

OK, this is wild. I’ve never seen a search result like this “map” of the knowledge that results from searching a paper. In this case, I searched the “CNVkit” paper. Each “cell” in this map can be zoomed in on, revealing sub-categories that further break down the knowledge and context of the papers. Below that are the actual papers themselves.

Here I’ve zoomed in on “Target” cell and then “Re-sequencing” cell. Now I’m down to individual papers that make up this “cell.”

I hope you’ve enjoyed this brief tour through some advanced ML-powered literature searching tools. I am going to make an effort to incorporate these into my own work with literature searching and see what difference it makes (maybe a subject for a future post).

Conference report: GLBIO2019

I just returned from another great experience at Great Lakes Bio 2019 (#GLBIO2019), a regional meeting of the International Society of Computational Biologists (ISCB). Below I’ll summarize briefly a few of the talks that I found most interesting to me personally (there were several parallel tracks, so I did not attend all talks).

Docker workshop taught by Sara Stevens

On Sunday of the conference, I attended a 3-hour workshop introducing Docker technology held in the beautiful and very modern Wisconsin Institutes of Discovery building. The course was taught by Sara Stevens, an expert in data science and bioinformatics with the data science hub at UWisconsin-Madison.

We worked through an initial “hello world” application of Docker on our laptops, writing a Dockerfile that became an image and finally a container instance of that image:

mchiment@MNE762:~/Desktop/docker-playground/my-greeting$cat Dockerfile 
#specify the base image
FROM alpine
#specify what to build
RUN /bin/echo "greeting!" > /root/my_message
#give default command
CMD ["/bin/cat", "/root/my_message"]

Then we progressed into more complex Dockerfile builds, including one that would install a mini-python distro and run a program. This included installing some libraries with pip within the image, and running a script.

Overall, I learned a lot and got a good grasp of the Docker basics to build upon for future work.

Integrative analysis for fine mapping of genetic variants, Sunduz Keles

In this talk, the issue of how to make sense of GWAS data was addressed. If you have a collection of SNPs, how to you follow up with which genes to study, which mechanisms to propose, etc… This talk introduced a tool, atSNP Search, which uses transcription factor position-weight matrices (PWMs) and assesses the impact of a SNP on TF DNA-binding activity within the local area of the SNP using the PWMs.

From the website:

atSNP identifies and quantifies best DNA sequence matches to the transcription factor PWMs with both the reference and the SNP alleles in a small window around the SNP location (up to +/- 30 base pairs and considering subsequences spanning the SNP position). It evaluates statistical significance of the match scores with each allele and calculates statistical significance of the score difference between the best matches with the reference and SNP alleles.

The talk also introduced a method, “FM-HighLD”, which asks whether you can substitute functional annotations of SNPs for “massive parallel reporter arrays” (MPRAs) which are considered “gold standard” for SNP/eQTL function. The idea is to use MPRA results and their correlation to functional annotations to calibrate the model and then apply that to eQTLs or GWAS SNPs with no MPRA results, but functional annotations from public databases.

refine.bio

There is over $4 Billion worth of publicly-funded RNAseq and microarray data in the public repositories. Studies have shown that analysts can spend up to 30% of a project’s time just searching, accessing, downloading, and preprocessing these data.

Refine.bio is an attempt to “harmonize” thousands of gene expression datasets by downloading and pre-processing them using a common pipeline and common reference. This is only possible owing to the innovation of pseudo-alignment in methods like kallisto and salmon.

In the background, refine.bio runs on Amazon Web Services, which gives the project unlimited compute and storage to scale according to their needs. In addition to standardized gene expression processing, sample metadata are also harmonized, where keywords are mapped to standard ontologies for ease of comparison.

Monitoring crude oil spills with 16S and machine-learning, Stephen Techtmann

In this work, Dr. Techtmann’s group was interested in looking at the response of fresh water microbiomes drawn from Lake Superior to the introduction of different types of oil (a complex chemical substance that acts as a carbon food source). Their team drew lake water samples and incubated them with different oils (heavy crude, refined crude, etc…) and then assessed taxonomic abundance using 16S amplicon gene sequencing.

The taxa abundances were used to train a Random Forest model to predict oil contamination status. RF methods produced a model with extremely high accuracy, AUC > 0.9. They found that two taxa predominantly distinguish the oil samples from the lake water samples.

Conference report: ISMB 2018 Chicago

In early July, I attended the ISMB 2018 meeting, a computational biology-focused meeting held by the International Society for Computational Biology (ISCB).  The meeting was held in the beautiful Hyatt Regency hotel in downtown Chicago, just across the street from the river and blocks from Navy Pier and Lakeshore Drive.

ISMB 2018 was a huge meeting, with at least 1500 attendees and up to ten parallel meeting tracks (called “COSIs” in ISCB parlance) at any one time.  The meeting was so big I always felt as if I was missing something good, no matter which talk I went to (except for the keynotes, where nothing else was happening).

Here is Steven Salzberg’s excellent keynote on a historical overview of finding genes in the human genome:

Obviously, at a meeting this large, one cannot explore more than a tiny fraction of the talks and posters (but you can watch many of the ISMB2018 talks on youtube now if you’re interested).

I want to briefly summarize three talks that I particularly enjoyed and found very interesting:

1) Michael Seiler, H3 biomedicine.  “Selective small molecule modulation of splicing in cancer”

Up to 60% of hematologic neoplasms (CLL/AML/MDS) contain heterozygous hotspot mutations in key splicing factor genes involved with 3′ splice site recognition.   One such gene is SF3B1, an RNA splicing factor, which is recurrently mutated around the HEAT repeats.  The cryo-EM structure of SF3B1 revealed the mutations cluster in the premRNA-interacting region.   All of the observed mutations lead to alternate 3′ splice sites being chosen during splicing; so-called “cryptic 3′ splice sites (AG)”. In particular, the K700E mutant slips the spliceosome upstream to an internal AG almost 18 nucleotides from the proper splice site.

RNA-seq of 200 CLL patients helped to discover further in-frame delections in SF3B1, SRSF2, and URAF1.   All of these mutations are HET, and the cells rely on the WT copy for survival.

Seiler asked the question whether one could exploit this weakness to attack cancer cells?  The idea is to modulate activity of SF3B1 using small molecules to disturb the WT function.  At H3 they took a natural product compound library and optimized for chemistry and binding to find candidates.  One molecule, H3B-8800, showed promise.  The compound was optimized toward selective induction of apoptosis in SF3B1 mutant cells in vitro. When resistance mutations occur, they are all at points of contact with the drug.  At only 13 nM dose, around 1% of splicing events were affected which was enough to cause lethality to the cancer cell.   RNA-seq demonstrated that it was mainly the spliceosome genes themselves that were affected by the alternate splicing in the presence of the inhibitor.  This demonstrated a delicate feedback loop between correct splicing and spliceosome gene expression.

2) Curtis Huttenhower,  Harvard University.  “Methods for multi-omics in microbial community population studies”

Huttenhower’s group at Harvard is well known for their contributions to methods for microbiome analysis, collectively known as the “biobakery.”  In this talk, Huttenhower addressed the fact that the microbiome is increasingly of interest when looking at population heath, and that chronic immune disease incidence is rising around the world over the last few decades.

Huttenhower also spent a good amount of time describing the IBD-MDB, the inflammatory bowel disease multi ‘omics database.  The database contains ~200 samples that are complete with six orthogonal datatypes, including RNA-seq, metagenome, and metabolome.   He talked about how they can associate bacteria enriched in IBD with covariation in the prevalence of metabolites.   For example, he showed how sphingolipids, carboximidic acids, cholesteryl esters and more are enriched while lactones, beta-diketones, and others are depleted in the guts of crohns disease suffers.

3) Olga Troyanskaya, Princeton University.  “ML approaches for the data-driven study of human disease”

Troyanskaya’s group at Princeton aims to develop accurate models of the complex processes underlying cellular function.  In this talk, she described efforts in her lab to use machine learning methods to understand how single nucleotide changes (SNPs) in non-coding regions can affect gene regulation and expression.  She is also interested in how pathways and networks change in different tissues and whether tissue-specific maps can be used to identify disease genes.

In particular, Dr. Troyanskaya described the “DeepSEA” software, which uses a convolution neural network (CNN) to attempt to predict the chromatin remodeling consequences of a SNP in a genomic context.   The model is a three layer CNN that takes 1000 base pairs of sequence as input.  The 1000 bp region is centered around known TF binding locations (200 bp bins).  The training data consists of a length 919 vector that contains binary values for the presence of a TF binding event across the genome (1 = binds, 0 = does not bind).  The output of the model is the probability that the specific sequence variant will affect TF binding at each of the 919 bins.

Schematic of the DeepSEA model, showing input data, training data, and output.

The DeepSEA model can be used to predict important sequence variants and even eQTLs for further study or to prioritize a list of known non-coding sequence variants.

 

 

 

Developers versus consumers of bioinformatics analysis tools

Life in the middle

As a bioinformatics applications scientist, I work in a middle ground between those who develop code for analyzing next-gen sequencing data and those who consume that analysis.   The developers are often people trained in computer science, mathematics, and statistics.   The consumers are often people trained in biology and medicine.    There is some overlap, of course, but if you’ll allow a broad generalization, I think the two groups (developers and consumers) are separated by large cultural differences in science.

Ensemble focus vs. single-gene focus

I think one of the biggest differences from my experience is the approach to conceptualizing the results of a next-gen seq experiment (e.g., RNA-seq).    People on the methods side tend to think in terms of ensembles and distributions.   They are interested in how the variation observed across all 30,000 genes can be modeled and used to estimate differential expression.   They are interested in concepts like shrinkage estimators, bayesian priors, and hypothesis weighting.  The table of differentially expressed features is thought to have meaning mainly in the statistical analysis of pathway enrichment (another ensemble).

Conversely, biologists have a drastically different view.   Often they care about a single gene or a handful of genes and how those genes vary across conditions of interest in the system that they are studying.  This is a rational response to the complexity of biological systems; no one can keep the workings of hundreds of genes in mind.  In order to make progress, you must focus.  However, this narrow focus leads investigators to sometimes cherry pick results pertaining to their ‘pet’ genes.  This can invest those results with more meaning than is warranted from a single experiment.

The gene-focus of biologists also leads to clashes with the ensemble-focus of bioinformatics software.  For example, major DE analysis packages like DESeq2 do not have convenience functions to make volcano plots, even though I’ve found that those kinds of plots are the most useful and easiest to understand for biologists.   Sleuth does have a volcano plotting function, but doesn’t allow for labeling genes.  In my experience, however, biologists want a high-res figure with common name gene labels (not ensemble transcript IDs) that they can circle and consider for wet lab validation.

New software to address the divide?

I am hopeful about the recent release of the new “bcbioRNASeq” package from Harvard Chan school bioinformatics (developers of ‘bcbio’ RNA-seq pipeline software).   It appears to be a step towards making it easier for people like me to walk on both sides of this cultural divide.    The package takes all of the outputs of the ‘bcbio’ pipeline and transforms them into an accessible S4 object that can be operated on quickly and simply within R.

Most importantly, the ‘bcbioRNASeq’ module allows for improved graphics and plotting that make communication with biologists easier.  For example, the new MA and volcano plots appear to be based on ‘ggplot2’ graphics and are quite pretty (and have text label options(!), not shown here):

I still need to become familiar with the ‘bcbioRNASeq’ package, but it looks quite promising.   ‘pcaExplorer‘ is another R/Bioconductor package that I feel does a great job of making accessible RNA-seq reports quickly and easy.  I suspect we will see continued improvements to make plots prettier and more informative, with an increasing emphasis on interactive, online plots and notebooks rather than static images.

Breakthrough advances in 2018 so far: flu, germs, and cancer

2018 medicine breakthrough review!

So far this year has seen some pretty important research breakthrough advances in several key areas of health and medicine.  I want to briefly describe some of what we’ve seen in just the first few months of 2018.

Flu

A pharmaceutical company in Japan has released phase 3 trial results showing that its drug, Xofluza, can effectively kill the virus in just 24 hours in infected humans.  And it can do this with just one single dose, compared to a 10-dose, three day regimen of Tamiflu. The drug works by inhibiting an endonuclease needed for replication of the virus.

Germs

It is common knowledge that antibiotics are over-prescribed and over-used.  This fact has led to the rise of MRSA and other resistant bacteria which threaten human health.  Although it is thought that bacteria could be a source of novel antibiotics since they are in constant chemical warfare with each other, most bacteria aren’t culture-friendly in the lab and so researchers haven’t been looking at them for leads.  Until now.

Malacidin drugs kill multi-drug resistant S. Aureus in tests on rats.

By adopting whole genome sequencing approaches to soil bacterial diversity, researchers were able to screen for gene clusters associated with calcium-binding motifs known for antibiotic activity.   The result was the discovery of a novel class of lipo-peptides, called malacidins A and B.  They showed potent activity against MRSA in skin infection models in rats.

The researchers estimate that 99% of bacterial natural-product antibiotic compounds remain unexplored at present.

Cancer

2017 and 2018 have seen some major advances with cancer treatment.   It seems that the field is moving away from the focus on small-molecule drugs towards harnessing the patient’s own immune system to attack cancer.  The CAR-T therapies for pediatric leukemia appear extremely promising.  These kinds of therapies are now in trials for a wide range of blood and solid tumors.

A great summary of the advances being made is available here from the Fred Hutchinson Cancer Research Center.   Here is how Dr. Gilliland, President of Fred Hutch, begins his review of the advances:

I’ve gone on record to say that by 2025, cancer researchers will have developed curative therapeutic approaches for most if not all cancers.

I took some flak for putting that stake in the ground. But we in the cancer research field are making incredible strides toward better and safer, potentially curative treatments for cancer, and I’m excited for what’s next. I believe that we must set a high bar, execute and implement — that there should be no excuses for not advancing the field at that pace.

This is a stunning statement on its own;  but made even more so because it is usually the scientists in the day-to-day trenches of research who are themselves the most pessimistic about the possibility of rapid advances.

Additionally, an important paper came out recently proposing a novel paradigm for understanding and modeling cancer incidence with age.  For a long time the dominant model has been the “two-hit” hypothesis which predicts that clinically-observable cancers arise when a cell acquires sufficient mutations in tumor-suppressor genes to become a tumor.

This paper challenges that notion and shows that a model of thymic function decline (the thymus produces T-cells) over time better describes the incidence of cancers with age.   This model better fits the data and leads to the conclusion that cancers are continually arising in our bodies, but it is our properly functioning immune system that roots them out and prevents clinical disease from emerging.  This model also helps explain why novel cancer immunotherapies are so potent and why focus has shifted to supporting and activating T-cells.

Declining T cell production leads to increasing disease incidence with age.