Five Common Problems with Messy Data

6957593947_75f7aaecd0_zReal world datasets are often quite messy and not well-organized for available data analysis tools.  The data scientist’s job often begins with whipping these messy datasets into shape for analysis.

Listed below are five of the most common problems with messy datasets, according to an excellent paper on “tidy data” by Hadley Wickham:

1) Column headers are variables, not variable names

Tabular data falls into this type, where columns are variables themselves.  For example,  a table with median income by percentile in columns and US states in rows. 

2) Multiple variables are stored in one column

An example here would be storing data in columns that combine two variables, like gender and age range.  Better to make two separate columns for gender and age range.

3) Variables are stored in both rows and columns

The most complex form of messy data.   For example, a dataset in which measurements from a weather station are stored according to date and time, with the various measurment types (temp, pressure, etc…) in a column called “measurements”.  

4) Multiple types of observational units are stored in the same table

A dataset that combines multiple unrelated observations or facts into one table.   For example, a clinical trial dataset that includes both treatment outcomes and diet choices into one large table by patient and date. 

5) A single observational unit stored in multiple tables

Measurements recorded in different tables split up by person, location, or time.  For example, a separate table of an individual’s medical history for each year of their life. 

Isotopic labeling of proteins in non-bacterial expression systems

As therapeutic proteins gain importance alongside traditional small molecule drugs, there is increasing interest in using NMR methods to examine their structure, dynamics, and stability/aggregation in solution.

Modern heteronuclear NMR of proteins relies on isotopically-labeled samples containing NMR active nuclei in the peptide backbone, sidechains, or both.

Although isotopic-labeling of recombinant protein is typically carried out in E. Coli expression systems, many biotherapeutic proteins must be expressed in eukaryotic systems to insure proper folding and/or post-translational modifications.   In practice, this means overexpression in either yeast, insect or mammalian cells.

Increased interest in attaining labeled protein samples for analysis by NMR is leading to better commercial availability of isotopically-labeled expression media and improved vectors for overexpression in non-bacterial systems.

Comprehensive reviews of state-of-the-art protocols and procedures for expression of isotopically-labeled proteins in non-standard systems are available here: yeast, insect cells, and mammalian cells.

 

 

Virtual screening capability for under $5K?

Many early stage companies may be missing out on the value that docking can provide at the validated hit and hit-to-lead stages of development, where structure/activity relationships (SAR) can help guide chemistry development of lead compounds.

While docking large HTS libraries with millions of compounds may require specialized CPU clusters, docking of small libraries (i.e., thousands of compounds) and SAR compounds from experimental assays is readily achievable in short time frames with a relatively inexpensive Intel Xeon workstation.

Following an initial investment in the workstation and software, follow-on costs are minimal (e.g., electricity, IT support and data backup). Turnaround times may be faster than with CRO services.  Also, sensitive IP data is also protected by being retained onsite and not transmitted over the internet.

Equipment / cost breakdown:

Software:

AutoDock Vina (non-restrictive commercial license)       cost: free

Accurate (benchmarked against 6 other commercial docking programs)

Compatible with AutoDock tools

Optimized for speed (orders of magnitude faster than previous generation)

Parallelized code for multi-core systems

AutoDock Tools (non-restrictive commercial license)      cost: free

PyMol Incentive (commercial license)                                        cost ~$90 / mo

Visualize docking results, free plugin can allow Vina to be run within PyMol GUI

Fedora Linux                                                                                              cost: free

Hardware:

HP Z620 Workstation (stock configuration)                          cost: $2999

2 GHz (6 Core) Intel Xeon E5-2620 2GHz

USB keyboard and mouse                                                                  cost: $50

Dell Ultrasharp 27” LED monitor                                                  cost: $649

1TB USB HD for data backup                                                          cost: $150

IT support for initial setup ~ 4 hours                                           cost: $400

Total initial capital expenditure:                                                  ~$4350