library(dplyr)
library(ggplot2)
library(palmerpenguins)
<- na.omit(palmerpenguins::penguins) penguins
Tidy Data Visualization: ggplot2 vs matplotlib
ggplot2
is based on Leland Wilkinson”s Grammar of Graphics, a set of principles for creating consistent and effective statistical graphics, and was developed by Hadley Wickham. The package is a cornerstone of the R community and integrates seamlessly with other tidyverse
packages. One of the key strengths of ggplot2
is its use of a consistent syntax, making it relatively easy to learn and enabling users to create a wide range of graphics with a common set of functions. The package is also highly customizable, allowing detailed adjustments to almost every element of a plot.
matplotlib
is a widely-used data visualization library in Python, renowned for its ability to produce high-quality graphs and charts. Rooted in an imperative programming style, matplotlib
provides a detailed control over plot elements, making it possible to fine-tune the aesthetics and layout of graphs to a high degree. Its compatibility with a variety of output formats and integration with other data science libraries like numpy
and pandas
makes it a cornerstone in the Python scientific computing stack.
The types of plots that I chose for the comparison heavily draw on the examples given in R for Data Science - an amazing resource if you want to get started with data visualization.
Loading packages and data
We start by loading the main packages of interest and the popular penguins
data that exists as packages for both . We then use the penguins
data frame as the data to compare all functions and methods below. Note that I drop all rows with missing values because I don’t want to get into related messages in this post.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from palmerpenguins import load_penguins
= load_penguins().dropna() penguins
A full-blown example
Let”s start with an advancved example that combines many different aesthetics at the same time: we plot two columns against each other, use color and shape aesthetics do differentiate species, include separate regression lines for each species, manually set nice labels, and use a theme. As you can see in this example already, ggplot2
and matplotlib
have a fundamentally different syntactic approach. While ggplot2
works with layers and easily allows the creation of regression lines for each species, you have to use a loop to get the same results with matplotlib
. We also can see the difference between the declarative and imperative programming styles.
ggplot(penguins,
aes(x = bill_length_mm, y = bill_depth_mm,
color = species, shape = species)) +
geom_point(size = 2) +
geom_smooth(method = "lm", formula = "y ~ x") +
labs(x = "Bill length (mm)", y = "Bill width (mm)",
title = "Bill length vs. bill width",
subtitle = "Using geom_point and geom_smooth of the ggplot2 package",
color = "Species", shape = "Species") +
theme_minimal()
= sorted(penguins["species"].unique())
species_unique = ["o", "s", "^"]
markers = ["red", "green", "blue"]
colors
for species, marker, color in zip(species_unique, markers, colors):
= penguins[penguins["species"] == species]
species_data
plt.scatter("bill_length_mm"], species_data["bill_depth_mm"],
species_data[= 50, alpha = 0.7, label = species, marker = marker, color = color
s
)
= species_data["bill_length_mm"]
X = species_data["bill_depth_mm"]
Y = np.polyfit(X, Y, 1)
m, b *X + b, color = color)
plt.plot(X, m
"Bill length (mm)")
plt.xlabel("Bill width (mm)")
plt.ylabel("Bill length vs. bill width")
plt.title(= "Species")
plt.legend(title
plt.show()
Visualizing distributions
A categorical variable
Let’s break down the differences in smaller steps by focusing on simpler examples. If you have a categorical variable and want to compare its relevance in your data, then ggplot2::geom_bar()
and matplotlib.bar()
are your friends. I manually specify the order and values in the matplotlib
figure to mimic the automatic behavior of ggplot2
.
ggplot(penguins,
aes(x = island)) +
geom_bar()
= penguins["island"].value_counts()
categorical_variable
plt.bar(categorical_variable.index, categorical_variable.values)
plt.show()
A numerical variable
If you have a numerical variable, usually histograms are a good starting point to get a better feeling for the distribution of your data. ggplot2::geom_histogram()
and matplotlib.hist
with options to control bin widths or number of bins are the functions for this task. Note that you have to manually compute the range of values for matplotlib
.
ggplot(penguins,
aes(x = bill_length_mm)) +
geom_histogram(binwidth = 2)
= penguins["bill_length_mm"].dropna()
numerical_variable
plt.hist(numerical_variable, = range(int(numerical_variable.min()),
bins int(numerical_variable.max()) + 2, 2))
plt.show()
Both packages also support density curves, but I personally wouldn”t recommend to start with densities because they are estimated curves that might obscure underlying data features. However, we look at densities in the next section.
Visualizing relationships
A numerical and a categorical variable
To visualize relationships, you need to have at least two columns. If you have a numerical and a categorical variable, then histograms or densities with groups are a good starting point. The next example illustrates the use of density curves via ggplot2::geom_density()
. In matplotlib
, we have to manually estimate the densities and plot the corresponding lines in a loop. Visually, we get quite similar results.
ggplot(penguins,
aes(x = body_mass_g, color = species, fill = species)) +
geom_density(linewidth = 0.75, alpha = 0.5)
from scipy.stats import gaussian_kde
= penguins["species"].unique()
species_list
for species in species_list:
= penguins[penguins["species"] == species]["body_mass_g"]
species_data
= gaussian_kde(species_data)
density = np.linspace(species_data.min(), species_data.max(), 200)
xs = lambda : .25
density.covariance_factor
density._compute_covariance()
= 0.75, label = species)
plt.plot(xs, density(xs), lw = 0.5)
plt.fill_between(xs, density(xs), alpha
"body_mass_g")
plt.xlabel("density")
plt.ylabel(
plt.legend()
plt.show()
Two categorical columns
Stacked bar plots are a good way to display the relationship between two categorical columns. geom_bar()
with the position
argument and matplotlib.bar()
are your aesthetics of choice for this task. For matplotlib
, we have to first compute the shares, then sequentially fill subplots. Note that you can easily switch to counts by using position = "identity"
in ggplot2
instead of relative frequencies as in the example below.
ggplot(penguins, aes(x = species, fill = island)) +
geom_bar(position = "fill")
= (penguins
shares lambda x: pd.crosstab(x["species"], x["island"]))
.pipe(lambda x: x.div(x.sum(axis = 1), axis = 0))
.pipe(
)
= plt.subplots()
fig, ax = None
bottom for island in shares.columns:
= bottom, label = island)
ax.bar(shares.index, shares[island], bottom = shares[island] if bottom is None else bottom + shares[island]
bottom
plt.show()
Two numerical columns
Scatter plots and regression lines are definitely the most common approach for visualizing the relationship between two numerical columns and we focus on scatter plots for this example (see the first visualization example if you want to see again how to add a regression line). Here, the size
parameter controls the size of the shapes that you use for the data points in ggplot2::geom_point()
relative to the base size (i.e., it is not tied to any unit of measurement like pixels). For matplotlib.scatter()
you have the s
parameter to control point sizes manually, where size is typically given in squared points (where a point is a unit of measure in typography, equal to 1/72 of an inch).
ggplot(penguins,
aes(x = bill_length_mm, y = flipper_length_mm)) +
geom_point(size = 2)
= penguins["bill_length_mm"], y = penguins["flipper_length_mm"],
plt.scatter(x = 50)
s "bill_length_mm")
plt.xlabel("flipper_length_mm")
plt.ylabel(
plt.show()
Three or more columns
You can include more information by mapping columns to additional aesthetics. For instance, we can map colors and shapes to species and create separate plots for each island by using facets. Facets are actually a great way to extend your figures, so I highly recommend playing around with them using your own data.
In ggplot2
you add the facet layer at the end, whereas in matplotlib
you have to start with the facet grid at the beginning and map scatter plots across facets. Note that I use variable assignment to penguins_facet
in order to prevent matplotlib
from printing the figure twice while rendering this post (no idea why though).
ggplot(penguins,
aes(x = bill_length_mm, y = flipper_length_mm)) +
geom_point(aes(color = species, shape = species)) +
facet_wrap(~island)
= sorted(penguins["species"].unique())
species = sorted(penguins["island"].unique())
islands
= dict(zip(species, ["blue", "green", "red"]))
color_map = dict(zip(species, ["o", "^", "s"]))
shape_map
= plt.subplots(ncols = len(islands))
fig, axes
for ax, island in zip(axes, islands):
= penguins[penguins["island"] == island]
island_data for spec in species:
= island_data[island_data["species"] == spec]
spec_data "bill_length_mm"], spec_data["flipper_length_mm"],
ax.scatter(spec_data[= color_map[spec], marker = shape_map[spec], label = spec)
color
ax.set_title(island)"Bill Length (mm)")
ax.set_xlabel("Flipper Length (mm)")
ax.set_ylabel(
0].legend(title = "Species")
axes[
plt.tight_layout()
plt.show()
Time series
As a last example, we quickly dive into time series plots where you typically want to show multiple lines over some time period. Here, I aggregate the number of penguins by year and island and plot the corresponding lines. All packages behave as expected and show similar output.
matplotlib
does not directly support setting line styles based on a variable like island within its syntax. Instead, you have to iterate over each island and manually set the line style for each island.
|>
penguins count(year, island) |>
ggplot(aes(x = year, y = n, color = island)) +
geom_line(aes(linetype = island))
= (penguins
count_data "year", "island"])
.groupby([
.size()="count")
.reset_index(name
)
= count_data["island"].unique()
islands = ["red", "green", "blue"]
colors = ["-", "--", "-."]
line_styles
plt.figure()for j, island in enumerate(islands):
= count_data[count_data["island"] == island]
island_data "year"], island_data["count"],
plt.plot(island_data[= colors[j],
color = line_styles[j],
linestyle = island)
label
"year")
plt.xlabel("count")
plt.ylabel(= "island")
plt.legend(title
plt.show()
Saving plots
As a final comparison, let us look at saving plots. ggplot2::ggsave()
provides the most important options as function paramters. In matplotlib
, you have to tweak the figure size before you can save the figure.
<- penguins |>
penguins_figure ggplot(aes(x = bill_length_mm, y = flipper_length_mm)) +
geom_point()
ggsave(penguins_figure, filename = "penguins-figure.png",
width = 7, height = 5, dpi = 300)
= (7, 5))
plt.figure(figsize = penguins["bill_length_mm"], y = penguins["flipper_length_mm"])
plt.scatter(x "bill_length_mm")
plt.xlabel("flipper_length_mm")
plt.ylabel(
"penguins-figure.png", dpi = 300) plt.savefig(
Conclusion
In terms of syntax, ggplot2
and matplotlib
are considerably different. ggplot2
uses a declarative style where you declare what the plot should contain. You specify mappings between data and aesthetics (like color, size) and add layers to build up the plot. This makes it quite structured and consistent. matplotlib
, on the other hand, follows an imperative style where you build plots step by step. Each element of the plot (like lines, labels, legend) is added and customized using separate commands. It allows for a high degree of customization but can be verbose for complex plots.
I think both approaches are powerful and have their unique advantages, and the choice between them often depends on your programming language preference and specific requirements of the data visualization task at hand.