Gene expression boxplots with ggplot2

The ubiquitous RNAseq analysis package, DESeq2, is a very useful and convenient way to conduct DE gene analyses.  However, it lacks some useful plotting tools.   For example, there is no convenience function in the library for making nice-looking boxplots from normalized gene expression data.

There are other packages one can rely on, for example ‘pcaExplorer’, but I like a simple approach sometimes to plot just a couple of genes.  So below I show you how to quickly plot your favorite gene using only ggplot2 (there is no “one weird trick” though…):

traf1_counts <- counts(ddsTxi['ENSG00000056558',], normalized = TRUE)
m <- list(counts = as.numeric(traf1_counts), group = as.factor(samples$group))
m <- as.tibble(m)
q <- ggplot(m, aes(group, counts)) + geom_boxplot() + geom_jitter(width = 0.1)
q <- q + labs(x = "Experimental Group", y = "Normalized Counts ", title = "Normalized Expression of TRAF1")
q <- q + scale_x_discrete(labels=c("00hn" = "PMN, 0hrs", "06hn" = "PMN, 6hrs",
                                   "06hy" = "PMN+Hp, 6hrs", "24hn" = "PMN, 24hrs", "24hy" = "PMN+Hp, 24hrs"))
q

As you can see above, first we must grab the normalized counts at the row corresponding with the Traf1 Ensembl ID using the ‘counts‘ function that operates on the ‘ddsTxi’ DESeqDataSet object.

In order to create a dataframe (well, a tibble to be specific) for plotting, we first create a list (‘m’) that combines the counts (as a numeric vector) and metadata group.  These two vectors will form the columns of the tibble for plotting, and we must give them names (i.e., “counts” and “group”) so the tibble conversion doesn’t complain.

The list, m, is then converted to a tibble with ‘as.tibble‘ and plotted with ggplot2, using an ‘aes(group,counts)‘ aesthetic plus a boxplot aesthetic.  The rest of the code is just modifying axis labels and tickmarks.  The final product looks like this:

Boxplot of normalized Traf1 expression in 5 different conditions (3 replicates each).

Importing a merged Seurat dataset into Monocle

I recently ran across a situation that I think is going to be increasingly common as I do more and more single-cell analyses.   Specifically, I had a project where the investigator had several experiments in related conditions that they want to merge and evaluate with a pseudotime analysis.   I could not find any useful tools within Monocle itself for merging data (please correct me in the comments if I’m missing something).   It looks as if you have to import a pre-merged seurat dataset.

Here is the workaround that I found [please note these commands are for Seurat v2, they will likely *not* work in v3]:

naive.data <- Read10X(data.dir = paste0(base_dir, "naive_outs/filtered_gene_bc_matrices/mm10/"))
naive <- CreateSeuratObject(raw.data = naive.data, min.cells = 3, min.genes = 200, project = "10X_naive")

lcmv.data <- Read10X(data.dir = paste0(base_dir, "lcmv_outs/filtered_gene_bc_matrices/mm10/"))
lcmv <- CreateSeuratObject(raw.data = lcmv.data, min.cells = 3, min.genes = 200, project = "10X_lcmv")

pygp.data <- Read10X(data.dir = paste0(base_dir, "pygp_outs/filtered_gene_bc_matrices/mm10/"))
pygp <- CreateSeuratObject(raw.data = pygp.data, min.cells = 3, min.genes = 200, project = "10X_pygp")

cd4.combined <- MergeSeurat(object1 = naive, object2 = lcmv, add.cell.id1 = "naive", add.cell.id2 = "lcmv")
cd4.combined <- AddSamples(object = cd4.combined, new.data = pygp.data, add.cell.id = "pygp")

cd4.combined@raw.data <- as.matrix(cd4.combined@raw.data)
cd4.monocle <- importCDS(cd4.combined, import_all = TRUE)

Here, I am reading in 10X data using Seurat (v2) w/ the Read10X function and then creating the Seurat object with CreateSeuratObject.

Once this done I use MergeSeurat to merge the first two experiments, and then AddSamples to add in the final experiment.   Then we can take advantage of the monocle function importCDS to import the combined object into monocle.

Now there is one final problem and that is that the “orig.ident” field is blank:

 

 

 

To recover the original identity of each cell, we can use the updated cell names from the merged Seurat dataset (i.e., “naive_AAACTGAGAAACCGA”).   We just need to split these and recover which experiment each cell came from with:

cellnames <- rownames(pData(cd4.monocle))
orig_ident <- sapply(strsplit(cellnames, split = "_"), '[', 1)
pData(cd4.monocle)$orig.ident <- orig_ident

We do a strsplit on the cellnames, splitting on underscore. The first value from the split in each case is assigned back into the ‘orig.ident’ field of the cell dataset object.

Now you’re ready to continue with the normal downstream analysis in monocle.  With dimensionality reduction and clustering done (not shown), we can plot the calculated clusters side-by-side with the experiment of origin (from the merged seurat dataset):

library(cowplot)
p1 <- plot_cell_clusters(cd4.monocle, color_by = 'orig.ident')
p2 <- plot_cell_clusters(cd4.monocle) #color_by cluster is default behavior
plot_grid(p1,p2)

And we get:

The PCA clusters on the tSNE plot (left) and orig.ident values on the tSNE plot (right). I have edited out the identities of the clusters on the right. This is unpublished data, I am using it here for educational purposes only. Please do not reproduce or copy this image.