Real world datasets are often quite messy and not well-organized for available data analysis tools. The data scientist’s job often begins with whipping these messy datasets into shape for analysis.
Listed below are five of the most common problems with messy datasets, according to an excellent paper on “tidy data” by Hadley Wickham:
1) Column headers are variables, not variable names
Tabular data falls into this type, where columns are variables themselves. For example, a table with median income by percentile in columns and US states in rows.
2) Multiple variables are stored in one column
An example here would be storing data in columns that combine two variables, like gender and age range. Better to make two separate columns for gender and age range.
3) Variables are stored in both rows and columns
The most complex form of messy data. For example, a dataset in which measurements from a weather station are stored according to date and time, with the various measurment types (temp, pressure, etc…) in a column called “measurements”.
4) Multiple types of observational units are stored in the same table
A dataset that combines multiple unrelated observations or facts into one table. For example, a clinical trial dataset that includes both treatment outcomes and diet choices into one large table by patient and date.
5) A single observational unit stored in multiple tables
Measurements recorded in different tables split up by person, location, or time. For example, a separate table of an individual’s medical history for each year of their life.