data analysis Archives - Michael's Bioinformatics Blog

January 28, 2016

Beware of biological variability in your *-Seq experiments

From this excellent paper on biological variability in RNA-Seq experiments (bold highlights are mine):

“Biological variability has important implications for the design, analysis and interpretation of RNA-sequencing experiments. […] If only a few biological replicates are available, it will be impossible to estimate the level of biological variability in expression for each gene in a study. Supplementary Table 1 summarizes a large number of published RNA-sequencing studies over the past three years. In every case, except for the two studies we analyzed here, conclusions were based on a small number (n ≤ 2) of biological replicates. One goal of RNA-sequencing studies may be simply to identify and catalog expression of new or alternative transcripts. However, all of these studies make broader biological statements on the basis of a very small set of biological replicates.

Our analysis has two important implications for studies performed with a small number of biological replicates. First, significant results in these studies may be due to biological variation and may not be reproducible; and second, it is impossible to know whether expression patterns are specific to the individuals in the study or are a characteristic of the study populations. These ideas are now widely accepted for DNA microarray experiments, where a large number of biological replicates are now required to justify scientific conclusions. Our analysis suggests that as biological variability is a fundamental characteristic of gene expression, sequencing experiments should be subject to similar requirements.”

If you are doing RNA-Seq, be very vigilant in your experimental design and find a way to incorporate more replicates, even at the expense of testing fewer comparisons. It’s better to test one comparison (tissue X vs. Y, for example) with 5 or more replicates than to test three comparisons (Tissue X vs. Y, Y vs. Z, and X vx Z) with only 2 replicates for each tissue type.

January 27, 2016

Hands-on with cancer mutational signatures, part 2

In this second part of the “Hands On” series, I want to address how to create the input for the MatLab mutational signature framework from the output of my python code to prepare the SNP data for analysis.

First, creating a Matlab .mat file for input to the program. The code is expecting an input file that contains a set of mutational catalogues and metadata information about the cancer type and the mutational types and subtypes represented in the data.

Fig 1. The required data types within one .mat file to run the framework.

As you can see from Fig 1., you need to provide a 96 by X matrix, where X is the number of samples in your mutational catalogue. You also need an X by 1 cell array specifying sample names, a 96 by 1 cell array specifying the subtypes (ACA, ACC, ACG, etc…) and a 96 by 1 cell array specifying the types (C>A, C>A, C>A, etc…). These must correspond to the same order as specified in the “originalGenomes” matrix or the results won’t make sense.

My code outputs .csv files for all of these needed inputs. For example, when you run my python code on your SNP list, you will get a “subtypes.csv”, “types.csv”, “samples.csv”, and “samples_by_counts.csv” matrix (i.e., originalGenomes) corresponding to the above cell arrays in Fig 1.

Now, the challenge is to get those CSV files into MatLab. You should have downloaded and installed MatLab on your PC. Open MatLab and select “Import Data.”

Fig 2. Select the “Import Data” button.

Browse to one of the output CSV files and select it. It will open in a new window like in Fig 3 below:

Fig 3. The data import window from MatLab.

Be sure to select the correct data type in the “imported data” section. Also, select only row 1 for import (row 2 is empty). Once you’re finished, click Import Selection. It will create a 1×96 cell called “types.” It looks like this:

Fig 4. The new imported cell data "types." — Fig 4. The new imported cell data “types.”

We’re almost done, but we have to switch the cell to be 96×1 rather than 1×96. To do this, just double-click it and select “transpose” in the variable editor. Now you should repeat this process for the other CSV input files, being sure to select “matrix” as the data type for the “samples_by_counts” file. Pay special attention to make sure the dimensions and data types are correct.

Once you have everything in place you should be ready do run the mutational analysis framework from the paper. To do this, open the “example2.m” Matlab script included with the download. In the “Define parameters” section, change the file paths to the .mat file you just created:

Fig 5. Define your parameters for the signature analysis.

Here you can see in Fig 5, I’ve specified 100 iterations per core, a number of possible signatures from 1-8, and the relative path to the input and output files. The authors say that ~1000 iterations is necessary for accuracy, but I’ve found little difference in the predictions between 50-500 iterations. I would perform as many iterations as possible, given constraints from time and computational power.

Note also that you may need to change the part of the code that deals with setting up a parallel computing pool. Since MatLab 2014, I believe, the “matlabpool” processing command has been deprecated. Substituting the “parpool” command seems to work fine for me (Mac OS X 10.10, 8 core Macbook Pro Retina) as follows:

if ( parpool('local') == 0 )

parpool open; % opens the default matlabpool, if it is not already opened

end

This post is getting lengthy, so I will stop here and post one more part later about how to compare the signatures you calculate with the COSMIC database using the cosine similarity metric.

August 13, 2015August 17, 2015

Create a volcano plot on EBSeq output

The differential expression analysis program EBSeq produces a number of data objects as part of the workflow, but there aren’t many options for visualization of the data.

The authors suggest the use of heatmap.2 in R:

heatmap.2(NormalizedMatrix[GenesOfInterest,], scale=”row”, trace=”none”, Colv=F)

However, this depends on knowing ahead of time your genes of interest. It is not practical to generate a heatmap with hundreds or thousands of DE genes.

I wanted to produce something approximating a volcano plot for EBSeq results. What I came up with initially was the following:

A pseudo-volcano plot for EBSeq results. The y-axis is posterior probability of differential expression (PPDE).

To make this plot, I had to grab some data arrays from the large object “EBOut” that is produced by calling the “EBTest” method and from the “GeneFC” object as below:

plot(GeneFC$PostFC, EBOut$PPDE, 
xlim =c(0,5), ylim=c(0,1), 
main="Control/Experimental FC vs. PPDE",
sub=GeneFC$Direction, xlab="EBSeq Posterior Fold Change", ylab="EBSeq posterior prob of DE")

abline(h=0.95)

The “abline” command places a horizontal line where the PPDE is equal to or greater than 95%. This would be equivalent to an FDR of 0.05.

If you want to inspect the plot interactively in R to identify gene names above the threshold and/or with large posterior fold changes you would use:

identify(GeneFC$PostFC, EBOut$PPDE, labels=names(GeneFC$PostFC))

To make it look more like a canonical volcano plot, I then tried:

plot(log2(GeneFC$RealFC), EBOut$PPDE, 
xlim =c(-5,5), 
ylim=c(0,1), 
main="Log2FoldChange vs. PPDE", 
xlab="EBSeq Log2 Fold Change", ylab="EBSeq PPDE")

Creating the following plot:

Fig 2. Log(2) Fold Change on the X-axis.

This is good, except I want to subset the data and add colors. To do this I need to create a new dataframe from the EBOut$PPDE and GeneFC$PostFC objects:

volc_df = data.frame(EBOut$PPDE, GeneFC$PostFC)

With everything in one dataframe, plotting and subsetting the data is easier. Inspired by this post at Stephen Turner’s “Getting Genetics Done” blog, I prepared my final colored volcano plot as follows:

with(volc_df, plot(log2(PostFC), PPDE, pch=20, main="Volcano Plot EBSeq", xlim=c(-5,5)))

abline(h=0.95)

with(subset(volc_df, PPDE > 0.95 & abs(log2(PostFC)) < 2), points(log2(PostFC), PPDE, pch=20, col="orange")) 

with(subset(volc_df, PPDE > 0.95 & abs(log2(PostFC)) < 2), points(log2(PostFC), PPDE, pch=20, col="red"))

The final plot looks like this:

Fig 3. The final plot, with colored points for the DE genes, and higher fold change indicated with red.

November 18, 2014March 16, 2015

Book Review: Python for Data Analysis

Introduction

The book “Python for Data Analysis” (O’Reilly Media 2013) by author Wes McKinney is a guide to using the NumPy, matplotlib, and pandas Python libraries for data analysis. The author sets out to provide a template for Python programmers to gain working knowledge of the rapidly maturing Python technologies for data analysis and visualization tasks. The tone of the book is conversational and focused, with no fluff or filler. The book accomplishes its purpose admirably by providing a concise, meaty, and highly readable tutorial through the essential features of doing data analysis in Python.

McKinney does a skillful job of bringing the Python novice through the requisite background and quickly up to speed doing useful work with pandas without becoming bogged down in introductory Python minutia. In fact, the opening chapter is titled “Introductory Examples” and includes several relatively complex data analysis examples that serve to demonstrate the capabilities of pandas. I found this approach provided me with the motivation to read on into the more detailed and technical chapters.

Why you should listen to Wes McKinney

The author is uniquely suited to write this book, having been the creator and first developer of pandas in the course of his own work as a quantitative analyst at a hedge fund back in 2008. I could tell that the author has a mastery of the subject; he provides many useful insights that could only be gained through real-world experience. The book focuses mainly on the pandas library and its core technologies, the Series and the Dataframe. Both are important because they build on the speed and precision of numpy arrays, while allowing richer, more intuitive and powerful manipulation of data tables.

pandas: it just works the way it should

Another aspect of this book that is so enjoyable is that pandas itself just works the way I would expect it to work. The tools, in my opinion, are constructed to be as convenient and intuitive as possible. I find that pandas behaves very predictably, despite being extremely powerful. Oftentimes, I was able to invent an expression in pandas that behaved exactly as I intended without knowing a priori whether it was possible to do so. There is something very satisfying about a tool that just works and doesn’t require a lot of boilerplate code.

The publisher also provides downloadable iPython notebooks containing the code examples for each chapter. Using these notebooks it was very easy to follow along, running code while reading the chapter. The illustrations in the book also consist almost entirely of matplotlib plots prepared using the code examples. I was able to work up many of the figures, giving me a sense of having gained practical, working knowledge in each chapter.

Python for data analysis? Yes!

I really have nothing negative to say about “Python for Data Analysis”. If forced to find something to change, it would be that the author could have left out the highly-condensed chapter on introductory Python programming found at the end of the book, using the extra space instead to include even more examples of pandas in practical, real-world applications.

For instance, an example on building a data analysis model with interactive graphics for the web would have been welcome. Similarly, a demonstration of approaches for making matplotlib, with its rather utilitarian graphics, more closely resemble the stylistically attractive plots of ggplot2 (the well-known R plotting library) would also have been useful.

After reading this book, however, I have been convinced to transition my data analysis workflow entirely into Python and largely abandon R, which now seems somewhat esoteric and unnecessarily complex by comparison. Overall, I would highly recommend this book to anyone seeking to learn how to use Python for data analysis. It is a valuable reference for scientists, engineers, data analysts, and others who want to leverage the power of Python (and specifically numpy and pandas) for dealing with their data.

October 30, 2014October 30, 2014

The Python ecosystem for beginners, part 2

Welcome to Part 2 of my post on the scientific Python ecosystem (Part 1 is here). I will describe a few more of the most common and useful libraries that make up the typical Python scientific computing stack. This is not an exhaustive list by any means, and new libraries are being continually developed by the open source community.

Matplotlib – high quality 2D and 3D plotting

Matplotlib is a plotting library that aims to make it easy to produce publication quality plots. In typical Python style, Matplotlib code can be very succinct and yet yield complete, high-quality plots. The library can generate many types of 2D graphs: regular plots, histograms, scatterplots, pie charts, statistical plots, and contour plots, to name a few.

Matplotlib is organized in a hierarchical manner that allows the user to quickly and easily create plots using high-level commands, while simultaneously allowing power users to delve into the object-oriented programming layer to control minute details of individual plots, should they choose to do so.

Traits – interactive class instances and GUI building

Traits is a powerful package that extends Python type attributes in interesting and useful ways. For instance, python objects such as classes can have attribute “traits” that allow for initialization (set an default value), notification (tell another part of the program that a value has changed) and visualization (respond to GUI inputs). Although it is possible to achieve this using Python properties, Traits reduces a lot of the boilerplate code and streamlines the process.

Chaco – interactive 2D plotting

Chaco is a plotting application toolkit for building rich, interactive plots. Chaco works with Traits to build object-oriented models of plots that can accept and react to inputs from the GUI.

Cython – speed up your code with C

The easiest way to think about Cython is to imagine it as a superset of the Python language. That is, all of the normal Python language is there, along with additional commands that allow code that calls back and forth to C/C++ libraries seamlessly. In Cython, you can also add static type declarations to python functions to get C-level speedups in computation. Cython code is compiled into C code for execution. Unlike weave, which allows inline C code but requires that the python code be re-compiled for C during every execution, Cython code is compiled only once (unless there are changes later) meaning that an end user does need to bother with recompiling to run the code as a standalone program.

Using Cython for numerical computation in Python, speedups of 2000X or more above the pure Python equivalent are not uncommon.

SciKit Learn – interactive machine learning

SciKit Learn is a machine-learning library for Python. It is based on NumPy, SciPy and matplotlib. There are many algorithms available for performing machine-learning tasks, falling into four main areas: classification, clustering, regression, and dimensionality reduction (principle component analysis).