The Python ecosystem for beginners, part 2

Welcome to Part 2 of my post on the scientific Python ecosystem (Part 1 is here).   I will describe a few more of the most common and useful libraries that make up the typical Python scientific computing stack.   This is not an exhaustive list by any means, and new libraries are being continually developed by the open source community.

Matplotlib – high quality 2D and 3D plotting

Matplotlib_3d

Matplotlib is a plotting library that aims to make it easy to produce publication quality plots.  In typical Python style, Matplotlib code can be very succinct and yet yield complete, high-quality plots.  The library can generate many types of 2D graphs: regular plots, histograms, scatterplots, pie charts, statistical plots, and contour plots, to name a few.

Matplotlib is organized in a hierarchical manner that allows the user to quickly and easily create plots using high-level commands, while simultaneously allowing power users to delve into the object-oriented programming layer to control minute details of individual plots, should they choose to do so.

Traits – interactive class instances and GUI building

Traits is a powerful package that extends Python type attributes in interesting and useful ways.  For instance, python objects such as classes can have attribute “traits”  that allow for initialization (set an default value), notification (tell another part of the program that a value has changed) and visualization (respond to GUI inputs).   Although it is possible to achieve this using Python properties, Traits reduces a lot of the boilerplate code and streamlines the process.

Chaco – interactive 2D plotting

Chaco is a plotting application toolkit for building rich, interactive plots.   Chaco works with Traits to build object-oriented models of plots that can accept and react to inputs from the GUI.

Cython – speed up your code with C

The easiest way to think about Cython is to imagine it as a superset of the Python language.   That is, all of the normal Python language is there, along with additional commands that allow code that calls back and forth to C/C++ libraries seamlessly.  In Cython, you can also add static type declarations to python functions to get C-level speedups in computation.  Cython code is compiled into C code for execution.  Unlike weave, which allows inline C code but requires that the python code be re-compiled for C during every execution, Cython code is compiled only once (unless there are changes later) meaning that an end user does need to bother with recompiling to run the code as a standalone program.

Using Cython for numerical computation in Python, speedups of 2000X or more above the pure Python equivalent are not uncommon.

SciKit Learn – interactive machine learning

SciKit Learn is a machine-learning library for Python.  It is based on NumPy, SciPy and matplotlib.  There are many algorithms available for performing machine-learning tasks, falling into four main areas: classification, clustering, regression, and dimensionality reduction (principle component analysis).

The Python ecosystem for beginners

When first starting to learn Python, I found the array of package names and libraries a bit bewildering and confusing.  In this post I will enumerate many of the most common and useful parts of the Python computing “ecosystem” and attempt to describe them very briefly.

My aim is to provide some clarity on the situation for new users, as I would have liked to have seen the “30,000 ft view” when I started learning not long ago.   So without further ado, part 1 of the python ecosystem overview:

Python -high-level, interpreted language

ImageCache

Python is an interpreted programming language (itself written in C) that allows you to write very clean and simple code in a fast and human-readable form as compared to lower-level compiled languages (C++, Fortran).

The language is simple, self-consistent and beautiful.   It is relatively easy to learn and is gaining in popularity every year.  The simplicity and ease of use does come with a price as code written in pure Python is generally slower to execute than compiled C/C++ code.

iPython -get code written faster

iPython is an enhanced, interactive Python shell designed to make code development faster.   According to Wes McKinney in his excellent book, “Python for Data Analysis,” iPython is designed to encourage “an execute-explore workflow instead of the typical edit-compile-run workflow of many other programming languages.”

iPython contains the Python interpreter and is ready to execute commands as you enter them.  It is where you run code snippets, examine the outputs, and make iterative improvements.  In that sense, it is like kind of like the UNIX command line.  You don’t write full programs here, you do that in a text editor or IDE (integrated development environment).

It also contains useful features called “magic commands.”  These are commands that are unique to the iPython command line, and are not valid Python code (i.e., you cannot use these commands in stand-alone programs).  Magic commands provide productivity speedups in many useful ways, such as recalling command history, running parts of scripts, timing code, and debugging code interactively.

If you have iPython installed you can invoke it from a regular python shell; however I find it easiest to use it within an IDE that supports iPython.

iPython Notebook -share code and ideas over the web

The iPython notebook is an interactive python format that runs in a web browser or in an IDE.  The Notebook is a flexible and powerful document format that allows python code, text markdown, mathematics equations, and figures to be displayed together in a coherent, inline way.  An iPython notebook could be used to provide all of the steps in a data analysis project, for example, or to teach a programming concept.  It is very useful for sharing code and visualizing results in an interactive, portable document.

NumPy -implement very fast vectorized computation

NumPy (numerical python) is a powerful and fast library for doing numerical computation in Python.  It is based on a data structure called an ndarray.  This lower level structure is faster for computation than regular higher-level python structures like lists and dictionaries, but it is less flexible and behaves in somewhat unintuitive ways.   Functions can be applied across a numpy array all at once and “in place”; this is known as vectorization.  NumPy contains a number of vectorized built-in functions known as “ufuncs” for doing transformations on ndarrrays.

pandas -powerful library for data analysis

pandas is a data analysis package for python; conceived and built initially by Wes McKinney.  It was developed to allow python users to access some of the powerful features of the R statistics language while staying in the python ecosystem.  Prior to the development of pandas, data analysis had to be carried out using the NumPy ndarray structures which are rather difficult for handling messy real world data.

Pandas achieves very fast speeds and efficiency because it is built on top of NumPy and therefore takes advantage of the built-in speed advantage of the low-level ndarray data structures.   However, pandas allows users to create higher-level structures called Series (1D), Dataframes (2D) and Panels (3D), that are more flexible and useful for regular data analysis than raw numpy arrays owing to their ability to contain mixed data types, headers, and indexes.

Pandas also contains many built-in methods for operations on Series, Dataframes, and Panels that allow users to quickly and easily do data aggregation, reductions, and “split-apply-combine” strategies.

SciPy -collection of libraries for a variety of computing applications

SciPy is a collection of scientific algorithms for doing scientific computing.   It is an open-source project under active development.  Some of the packages available include scipy.linalg for doing linear algebra, scipy.stats for statistics, scipy.cluster for clustering (K-means and others), scipy.fftpack for doing fourier transform analysis, scipy.optimize for doing curve fitting and minimization, and scipy.signal for doing signal processing.  There are many more packages in SciPy;  what you end up using will depend on your application and area of interest.

In part 2, I will describe even more libraries and packages that you will encounter as you learn scientific computing with Python.

Automate your Topspin NMR workflow

Here is a tip for scientists that need to batch process NMR data quickly and uniformly for analysis.  This approach could be a big time-saver in situations where you have a large series of 1D reference spectra collected by sample automation, for example.  Or in NMR screening applications, where dozens of STD-NMR experiments are being collected during an overnight run.

Hidden away in the Topsin “Processing” menu is a feature called “Serial Processing:”

Screen shot 2014-10-20 at 2.39.51 PMSelect this menu option and you will see the following dialogue:

Screen shot 2014-10-20 at 2.49.37 PMSince this is first time you are doing this operation, you need to select “find datasets” in order to first find the data to process.  In the future, you will have a “list” created for you by the program that you can reuse to reference datasets in combinations that you specify.

When you click “find datasets” you will see this dialogue:

Screen shot 2014-10-20 at 2.40.02 PMSelect the data directory to search from the “data directories” box at the bottom of the window.   (If your NMR data directory is not here, it is because you haven’t added it to the Topspin file brower in the main Topspin window.  Go do that first, and come back and try this operation again.)

Under the “name” field, enter the name of the specific dataset directory you wish to search, or leave it blank to search across many directories.  You can also match on experiment number (EXPNO) or process number (PROCNO).  The check boxes enforce exact matching.  You can select 1D or higher dimensional datasets for processing.  You can also match by date.

When you’ve made your selections, it will look like this:

Screen shot 2014-10-20 at 2.40.46 PMIn this search, I am selecting for all 1D data contained in the “Oct16-2014-p97” subdirectory of my NMR data repository at “/Users/sandro/UCSF/p97_hit2lead/nmr”.

Click “OK” and wait for the results.  Mine look like the following:

Screen shot 2014-10-20 at 2.41.18 PMThe program has found 24 datasets that match my criteria.  At this point, you want to select only those you wish to batch process.  I will select all files like this:

Screen shot 2014-10-20 at 2.41.36 PMNow click “OK” and you are returned to this prompt:

Screen shot 2014-10-20 at 2.42.00 PMNotice that the program has now created a list of datasets for batch processing for you, store in the ‘/var/folders/’ temporary directory.  The list is a text-based list of the filenames you specified by your selection criteria.  You can edit by hand or proceed to the next step.  To proceed, click “next.”  You will now see this dialogue:

Screen shot 2014-10-20 at 2.42.20 PMThis is where the useful, time-saving stuff happens.   This dialogue takes the list you defined and applies whatever custom command sequence you would like to apply to your data.  You define this sequence in the text box at the bottom.  As you can see, I have chosen to perform “lb 1; em; ft; pk.”   This is line broadening = 1, exponential multiplication, fourier transform, and phase correction.  You can also specify a path to a python script for the Topspin API.

Once you have your desired processing commands, click “Execute” and go grab a coffee!  You just saved yourself many minutes of routine processing of NMR spectra.    Hope you find this tip useful and that it can save you some time in your day.