Tidy Data: A Recipe for Efficient Data Analysis

Data

On the importance of tidy data for efficient analysis using the analogy of a well-organized kitchen

Author

Christoph Scheuch

Published

November 24, 2023

Imagine trying to cook a meal in a disorganized kitchen where ingredients are mixed up and nothing is labeled. It would be chaotic and time-consuming to look for the right ingredients and there might be some trial error involved, possibly ruining your planned meal.

Tidy data are like well-organized shelves in your kitchen. Each shelf provides a collection of containers that semantically belong together, e.g., spices or dairies. Each container on the shelf holds one type of ingredient, and the labels on the containers clearly describe what is inside, e.g., pepper or milk. In the same way, tidy data organizes information into a clear and consistent format, where each type of observational unit forms a table, each variable is in a column, and each observation is in a row (Wickham 2014).

Tidying data is about structuring datasets to facilitate analysis, visualization, report generation, or modelling. By following the principle that each variable forms a column, each observation forms a row, and each type of observational unit forms a table, data analysis becomes more intuitive, akin to cooking in a well-organized kitchen where everything has its place and you spend less time on searching for ingredients.

Example for tidy data

To illustrate the concept of tidy data in our tidy kitchen, suppose we have a table called ingredient that contains information about all the ingredients that we currently have in our kitchen. It might look as follows:

name	quantity	unit	category
flour	500	grams	baking
sugar	200	grams	baking
butter	100	grams	dairy
eggs	4	units	dairy
milk	1	liters	dairy
salt	10	grams	seasoning
olive oil	0.2	liters	oil
tomatoes	300	grams	vegetable
chicken	400	grams	meat
rice	250	grams	grain

Each row refers to a specific ingredient and each column has a dedicated type and meaning. For instance, the column quantity contains information about how much of the ingredient called name we currently have and which unit we use to measure it.

Similarly, we could have a table just for dairy that might look as follows:

name	quantity	unit
milk	1	liters
butter	200	grams
yogurt	150	grams
cheese	100	grams
cream	0.5	liters
cottage cheese	250	grams
sour cream	150	grams
ghee	100	grams
whipping cream	0.3	liters
ice cream	500	grams

Notice that there is no category column in this table? It would actually be redundant to have this column because all rows in the `dairy`` table have the same category.

When colum headers are values, not variable names

Now let us move to data structures that are untidy. Consider the following variant of our dairy table:

type	liters	grams
milk	1
butter		200
yogurt		150
cheese		100
cream	0.5
cottage cheese		250
sour cream		150
ghee		100
whipping cream	0.3
ice cream		500

What is the issue here? Each row still refers to a specific dairy product. However, instead of dedicated quantity and unit columns, we have a liters and grams column. Since the units differ across dairy products, the table even contains missing values in the form of emtpy cells. So if you want to find out how much of ice cream you still have, you need to also check out the column name. In practice, we would create dedicated quantity and unit columns. we might even decide to have the same unit for all ingredients (e.g., measure everything in grams) and just keep a quantity column.

When multiple variables are stored in one column

Let us consider the following untidy version of our ingredient table.

type	quantity_and_unit
flour	500 grams
sugar	200 grams
butter	100 grams
eggs	4 units
milk	1 liter
salt	10 grams
olive oil	0.2 liters
tomatoes	300 grams
chicken	400 grams
rice	250 grams

This one is really annoying, since the quantity_and_unit column combines both the quantity and the unit of measurement into one string for each ingredient. Why is this an issue? This format actually makes it harder to perform numerical operations on the quantities or to filter or aggregate the data based on the unit of measurement. So in practice, we would actually start our data analysis by splitting out the quantity_and_unit column into quantity and unit.

When variables are stored in both rows and columns

Let us extend our kitchen analogy by additionally considering recipes. For simplicity, a recipe just denotes how much of each ingredient is required. The following table contains two variants of a recipe for pancakes:

ingredient	recipe1_quantity	recipe2_quantity
flour	500 grams	300 grams
sugar	200 grams	150 grams
butter	100 grams	50 grams
eggs	4 units	3 units
milk	1 liters	0.5 liters

The quantity for each ingredient for two different recipes is stored in separate columns. This structure makes it harder to perform operations like filtering or summarizing the data by recipe or ingredient.

To convert this data to a tidy format, you would typically want to gather the quantities into a single column, and include additional columns to specify the recipe and unit of measurement for each quantity. We can then filer

When there are multiple types of data in the same column

A recipe typically contains information on the required utensils and how much time a step requires. Consider the following table with different types of data:

type	quantity	category
flour	500 grams	ingredient
butter	100 grams	ingredient
whisk	1 unit	utensil
sugar	200 grams	ingredient
baking time	30 minutes	time

The table is trying to describe a recipe but combines different types of data within the same columns. There are ingredients with their quantities, a utensil, and cooking time, all mixed together.

A tidy approach would typically separate these different types of data into separate tables or at least into distinct sets of columns, making it clear what each part of the data represents and facilitating further analysis and visualization.

When some data is missing

As a last example for untidy data, let us consider the original ingredient table again, but with a few empty cells.

name	quantity	unit
flour		grams
sugar	200	grams
butter	100	grams
eggs	4	units
milk	10
salt	10	grams
olive oil	0.2	liters
tomatoes	300	grams
chicken	400	grams
	250	grams

What is the issue here? There are actually a couple of them:

The flour row does have any information about quantity, so we just don’t know how much we have.
The milk row does not contain a unit, so we might have 10 liters, 10 milliliters, or 10 cups of milk.
The last row does not have any name, so we have 250 grams of something that we just can’t identify.

Why is this important? It makes a huge difference how me treat the missing information. For instance, we might make an educated guess for milk if we always record that information in litres, then the missing unit is very likely litres. For flour, we could play it safe and just say that the available quantity is zero. For the ingredient without a name, we might have to throw it away or ask somebody else to tell us what it is.

Overall, these examples highlight the most important issues that you might have to consider when preparing data for your analysis.

References

Wickham, Hadley. 2014. “Tidy data.” Journal of Statistical Software 59 (1): 1–23. https://doi.org/10.18637/jss.v059.i10.