library(dplyr)
library(palmerpenguins)
<- palmerpenguins::penguins penguins
Tidy Data Manipulation: dplyr vs ibis
There are a myriad of options to perform essential data manipulation tasks in R and Python (see, for instance, my other posts on dplyr vs pandas and dplyr vs polars). However, if we want to do tidy data science in R, there is a clear forerunner: dplyr
. In the world of Python, ibis
has been around since 2015 but recently gained traction due to its appealing flexibility with respect to data backends. In this blog post, I illustrate their syntactic similarities and highlight differences between these two packages that emerge for a few key tasks.
Before we dive into the comparison, a short introduction to the packages: the dplyr
package in R allows users to refer to columns without quotation marks due to its implementation of non-standard evaluation (NSE). NSE is a programming technique used in R that allows functions to capture the expressions passed to them as arguments, rather than just the values of those arguments. The primary goal of NSE in the context of dplyr
is to create a more user-friendly and intuitive syntax. This makes data manipulation tasks more straightforward and aligns with the general philosophy of the tidyverse
to make data science faster, easier, and more fun.1
ibis
is a Python library that provides a lightweight and universal interface for data wrangling using many different data backends. The core idea behind ibis
is to provide Python users with a familiar pandas
-like syntax while allowing them to work with larger datasets that don’t fit into memory. As you see in the post below, the ibis
syntax can be surprisingly closer to dplyr
than to the original idea of resembling pandas.
In addition, ibis
builds an expression tree as you write code. This tree is then translated into the native query language of the target data source, be it SQL or something else, and executed remotely (similar to the dbplyr
package in R). This approach ensures that only the final results are loaded into Python, significantly reducing memory overhead.
Loading packages and data
We start by loading the main packages of interest and the popular palmerpenguins
package that exists for both R and Python. We then use the penguins
data frame as the data to compare all functions and methods below. Note that we also enable the interactive mode in ibis
to limit the print output of ibis
data frames to 10 rows.
Note that the ibis-framework
package is not the same as the ibis
package in PyPI. These two libraries cannot coexist in the same Python environment, as they are both imported with the ibis module name. So be careful to install the correct ibis-framework
package via: pip install 'ibis-framework[duckdb]'
import ibis
import ibis.selectors as s
from ibis import _
from palmerpenguins import load_penguins
= True
ibis.options.interactive
= ibis.memtable(load_penguins(), name = "penguins") penguins
Work with rows
Filter rows
Filtering rows works very similarly for both packages, they even have the same function names: dplyr::filter()
and ibis.filter()
. To select columns in ibis
, you need the ibis._
selector. Note that you have to provide a dictionary to ibis.filter()
in case you want to have multiple conditions.
|>
penguins filter(species == "Adelie" &
%in% c("Biscoe", "Dream")) island
# A tibble: 100 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Biscoe 37.8 18.3 174 3400
2 Adelie Biscoe 37.7 18.7 180 3600
3 Adelie Biscoe 35.9 19.2 189 3800
4 Adelie Biscoe 38.2 18.1 185 3950
5 Adelie Biscoe 38.8 17.2 180 3800
6 Adelie Biscoe 35.3 18.9 187 3800
7 Adelie Biscoe 40.6 18.6 183 3550
8 Adelie Biscoe 40.5 17.9 187 3200
9 Adelie Biscoe 37.9 18.6 172 3150
10 Adelie Biscoe 40.5 18.9 180 3950
# ℹ 90 more rows
# ℹ 2 more variables: sex <fct>, year <int>
(penguinsfilter([
.== "Adelie",
_.species "Biscoe", "Dream"])
_.island.isin([
]) )
┏━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━┓
┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ … ┃
┡━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━┩
│ string │ string │ float64 │ float64 │ float64 │ … │
├─────────┼────────┼────────────────┼───────────────┼───────────────────┼───┤
│ Adelie │ Biscoe │ 37.8 │ 18.3 │ 174.0 │ … │
│ Adelie │ Biscoe │ 37.7 │ 18.7 │ 180.0 │ … │
│ Adelie │ Biscoe │ 35.9 │ 19.2 │ 189.0 │ … │
│ Adelie │ Biscoe │ 38.2 │ 18.1 │ 185.0 │ … │
│ Adelie │ Biscoe │ 38.8 │ 17.2 │ 180.0 │ … │
│ Adelie │ Biscoe │ 35.3 │ 18.9 │ 187.0 │ … │
│ Adelie │ Biscoe │ 40.6 │ 18.6 │ 183.0 │ … │
│ Adelie │ Biscoe │ 40.5 │ 17.9 │ 187.0 │ … │
│ Adelie │ Biscoe │ 37.9 │ 18.6 │ 172.0 │ … │
│ Adelie │ Biscoe │ 40.5 │ 18.9 │ 180.0 │ … │
│ … │ … │ … │ … │ … │ … │
└─────────┴────────┴────────────────┴───────────────┴───────────────────┴───┘
Slice rows
dplyr::slice()
takes integers with row numbers as inputs, so you can use ranges and arbitrary vectors of integers. ibis.limit()
only takes the number of rows to slice and the number of rows to skip as inputs. For instance, to the the same result of slicing rows 10 to 20, the code looks as follows (note that indexing starts at 0 in Python, while it starts at 1 in R):
|>
penguins slice(10:20)
# A tibble: 11 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 42 20.2 190 4250
2 Adelie Torgersen 37.8 17.1 186 3300
3 Adelie Torgersen 37.8 17.3 180 3700
4 Adelie Torgersen 41.1 17.6 182 3200
5 Adelie Torgersen 38.6 21.2 191 3800
6 Adelie Torgersen 34.6 21.1 198 4400
7 Adelie Torgersen 36.6 17.8 185 3700
8 Adelie Torgersen 38.7 19 195 3450
9 Adelie Torgersen 42.5 20.7 197 4500
10 Adelie Torgersen 34.4 18.4 184 3325
11 Adelie Torgersen 46 21.5 194 4200
# ℹ 2 more variables: sex <fct>, year <int>
(penguins11, offset = 9)
.limit( )
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━┓
┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ … ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━┩
│ string │ string │ float64 │ float64 │ float64 │ … │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼───┤
│ Adelie │ Torgersen │ 42.0 │ 20.2 │ 190.0 │ … │
│ Adelie │ Torgersen │ 37.8 │ 17.1 │ 186.0 │ … │
│ Adelie │ Torgersen │ 37.8 │ 17.3 │ 180.0 │ … │
│ Adelie │ Torgersen │ 41.1 │ 17.6 │ 182.0 │ … │
│ Adelie │ Torgersen │ 38.6 │ 21.2 │ 191.0 │ … │
│ Adelie │ Torgersen │ 34.6 │ 21.1 │ 198.0 │ … │
│ Adelie │ Torgersen │ 36.6 │ 17.8 │ 185.0 │ … │
│ Adelie │ Torgersen │ 38.7 │ 19.0 │ 195.0 │ … │
│ Adelie │ Torgersen │ 42.5 │ 20.7 │ 197.0 │ … │
│ Adelie │ Torgersen │ 34.4 │ 18.4 │ 184.0 │ … │
│ … │ … │ … │ … │ … │ … │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴───┘
Arrange rows
To orders the rows of a data frame by the values of selected columns, we have dplyr::arrange()
and ibis.order_by()
. Both approaches arrange rows in an an ascending order and puts missing values last. Again, you need to provide a dictionary to ibis.order_by()
.
|>
penguins arrange(island, desc(bill_length_mm))
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Gentoo Biscoe 59.6 17 230 6050
2 Gentoo Biscoe 55.9 17 228 5600
3 Gentoo Biscoe 55.1 16 230 5850
4 Gentoo Biscoe 54.3 15.7 231 5650
5 Gentoo Biscoe 53.4 15.8 219 5500
6 Gentoo Biscoe 52.5 15.6 221 5450
7 Gentoo Biscoe 52.2 17.1 228 5400
8 Gentoo Biscoe 52.1 17 230 5550
9 Gentoo Biscoe 51.5 16.3 230 5500
10 Gentoo Biscoe 51.3 14.2 218 5300
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
(penguins
.order_by([_.island, _.bill_length_mm.desc()]) )
┏━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━┓
┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ … ┃
┡━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━┩
│ string │ string │ float64 │ float64 │ float64 │ … │
├─────────┼────────┼────────────────┼───────────────┼───────────────────┼───┤
│ Gentoo │ Biscoe │ 59.6 │ 17.0 │ 230.0 │ … │
│ Gentoo │ Biscoe │ 55.9 │ 17.0 │ 228.0 │ … │
│ Gentoo │ Biscoe │ 55.1 │ 16.0 │ 230.0 │ … │
│ Gentoo │ Biscoe │ 54.3 │ 15.7 │ 231.0 │ … │
│ Gentoo │ Biscoe │ 53.4 │ 15.8 │ 219.0 │ … │
│ Gentoo │ Biscoe │ 52.5 │ 15.6 │ 221.0 │ … │
│ Gentoo │ Biscoe │ 52.2 │ 17.1 │ 228.0 │ … │
│ Gentoo │ Biscoe │ 52.1 │ 17.0 │ 230.0 │ … │
│ Gentoo │ Biscoe │ 51.5 │ 16.3 │ 230.0 │ … │
│ Gentoo │ Biscoe │ 51.3 │ 14.2 │ 218.0 │ … │
│ … │ … │ … │ … │ … │ … │
└─────────┴────────┴────────────────┴───────────────┴───────────────────┴───┘
Work with columns
Select columns
Selecting a subset of columns works essentially the same for both and dplyr::select()
and ibis.select()
even have the same name. Note that you don’t have to use ibis._
but can also just pass strings in the ibis.select()
method.
|>
penguins select(bill_length_mm, sex)
# A tibble: 344 × 2
bill_length_mm sex
<dbl> <fct>
1 39.1 male
2 39.5 female
3 40.3 female
4 NA <NA>
5 36.7 female
6 39.3 male
7 38.9 female
8 39.2 male
9 34.1 <NA>
10 42 <NA>
# ℹ 334 more rows
(penguins
.select(_.bill_length_mm, _.sex) )
┏━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ bill_length_mm ┃ sex ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ float64 │ string │
├────────────────┼────────┤
│ 39.1 │ male │
│ 39.5 │ female │
│ 40.3 │ female │
│ NULL │ NULL │
│ 36.7 │ female │
│ 39.3 │ male │
│ 38.9 │ female │
│ 39.2 │ male │
│ 34.1 │ NULL │
│ 42.0 │ NULL │
│ … │ … │
└────────────────┴────────┘
Rename columns
Renaming columns also works very similarly with the major difference that ibis.rename()
does not accept the column selector ibis._
on the right-hand side, while dplyr::rename()
takes variable names via the usual NSE.
|>
penguins rename(bill_length = bill_length_mm,
bill_depth = bill_depth_mm)
# A tibble: 344 × 8
species island bill_length bill_depth flipper_length_mm body_mass_g sex
<fct> <fct> <dbl> <dbl> <int> <int> <fct>
1 Adelie Torgersen 39.1 18.7 181 3750 male
2 Adelie Torgersen 39.5 17.4 186 3800 female
3 Adelie Torgersen 40.3 18 195 3250 female
4 Adelie Torgersen NA NA NA NA <NA>
5 Adelie Torgersen 36.7 19.3 193 3450 female
6 Adelie Torgersen 39.3 20.6 190 3650 male
7 Adelie Torgersen 38.9 17.8 181 3625 female
8 Adelie Torgersen 39.2 19.6 195 4675 male
9 Adelie Torgersen 34.1 18.1 193 3475 <NA>
10 Adelie Torgersen 42 20.2 190 4250 <NA>
# ℹ 334 more rows
# ℹ 1 more variable: year <int>
(penguins= "bill_length_mm",
.rename(bill_length = "bill_depth_mm")
bill_depth )
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━┓
┃ species ┃ island ┃ bill_length ┃ bill_depth ┃ flipper_length_mm ┃ … ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━┩
│ string │ string │ float64 │ float64 │ float64 │ … │
├─────────┼───────────┼─────────────┼────────────┼───────────────────┼───┤
│ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181.0 │ … │
│ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186.0 │ … │
│ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195.0 │ … │
│ Adelie │ Torgersen │ NULL │ NULL │ NULL │ … │
│ Adelie │ Torgersen │ 36.7 │ 19.3 │ 193.0 │ … │
│ Adelie │ Torgersen │ 39.3 │ 20.6 │ 190.0 │ … │
│ Adelie │ Torgersen │ 38.9 │ 17.8 │ 181.0 │ … │
│ Adelie │ Torgersen │ 39.2 │ 19.6 │ 195.0 │ … │
│ Adelie │ Torgersen │ 34.1 │ 18.1 │ 193.0 │ … │
│ Adelie │ Torgersen │ 42.0 │ 20.2 │ 190.0 │ … │
│ … │ … │ … │ … │ … │ … │
└─────────┴───────────┴─────────────┴────────────┴───────────────────┴───┘
Mutate columns
Transforming existing columns or creating new ones is an essential part of data analysis. dplyr::mutate()
and ibis.mutate()
are the work horses for these tasks. A big difference between dplyr::mutate()
and ibis.mutate()
is that in the latter you have to chain separate mutate calls together when you reference newly-created columns in the same mutate whereas in dplyr
, you can put them all in the same call.
|>
penguins mutate(ones = 1,
bill_length = bill_length_mm / 10,
bill_length_squared = bill_length^2) |>
select(ones, bill_length_mm, bill_length, bill_length_squared)
# A tibble: 344 × 4
ones bill_length_mm bill_length bill_length_squared
<dbl> <dbl> <dbl> <dbl>
1 1 39.1 3.91 15.3
2 1 39.5 3.95 15.6
3 1 40.3 4.03 16.2
4 1 NA NA NA
5 1 36.7 3.67 13.5
6 1 39.3 3.93 15.4
7 1 38.9 3.89 15.1
8 1 39.2 3.92 15.4
9 1 34.1 3.41 11.6
10 1 42 4.2 17.6
# ℹ 334 more rows
(penguins = 1,
.mutate(ones = _.bill_length_mm / 10)
bill_length = _.bill_length**2)
.mutate(bill_length_squared
.select(_.ones, _.bill_length_mm, _.bill_length, _.bill_length_squared) )
┏━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ ones ┃ bill_length_mm ┃ bill_length ┃ bill_length_squared ┃
┡━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ int8 │ float64 │ float64 │ float64 │
├──────┼────────────────┼─────────────┼─────────────────────┤
│ 1 │ 39.1 │ 3.91 │ 15.2881 │
│ 1 │ 39.5 │ 3.95 │ 15.6025 │
│ 1 │ 40.3 │ 4.03 │ 16.2409 │
│ 1 │ NULL │ NULL │ NULL │
│ 1 │ 36.7 │ 3.67 │ 13.4689 │
│ 1 │ 39.3 │ 3.93 │ 15.4449 │
│ 1 │ 38.9 │ 3.89 │ 15.1321 │
│ 1 │ 39.2 │ 3.92 │ 15.3664 │
│ 1 │ 34.1 │ 3.41 │ 11.6281 │
│ 1 │ 42.0 │ 4.20 │ 17.6400 │
│ … │ … │ … │ … │
└──────┴────────────────┴─────────────┴─────────────────────┘
Relocate columns
dplyr::relocate()
provides options to change the positions of columns in a data frame, using the same syntax as dplyr::select()
. In addition, there are the options .after
and .before
to provide users with additional shortcuts.
The recommended way to relocate columns in ibis
is to use the ibis.select()
method, but there are no options as in dplyr::relocate()
. In fact, the safest way to consistently get the correct order of columns is to explicitly specify them.
|>
penguins relocate(c(species, bill_length_mm), .before = sex)
# A tibble: 344 × 8
island bill_depth_mm flipper_length_mm body_mass_g species bill_length_mm
<fct> <dbl> <int> <int> <fct> <dbl>
1 Torgersen 18.7 181 3750 Adelie 39.1
2 Torgersen 17.4 186 3800 Adelie 39.5
3 Torgersen 18 195 3250 Adelie 40.3
4 Torgersen NA NA NA Adelie NA
5 Torgersen 19.3 193 3450 Adelie 36.7
6 Torgersen 20.6 190 3650 Adelie 39.3
7 Torgersen 17.8 181 3625 Adelie 38.9
8 Torgersen 19.6 195 4675 Adelie 39.2
9 Torgersen 18.1 193 3475 Adelie 34.1
10 Torgersen 20.2 190 4250 Adelie 42
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
(penguins
.select(_.island, _.bill_depth_mm, _.flipper_length_mm, _.body_mass_g,
_.species, _.bill_length_mm, _.sex) )
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━┓
┃ island ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ species ┃ … ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━┩
│ string │ float64 │ float64 │ float64 │ string │ … │
├───────────┼───────────────┼───────────────────┼─────────────┼─────────┼───┤
│ Torgersen │ 18.7 │ 181.0 │ 3750.0 │ Adelie │ … │
│ Torgersen │ 17.4 │ 186.0 │ 3800.0 │ Adelie │ … │
│ Torgersen │ 18.0 │ 195.0 │ 3250.0 │ Adelie │ … │
│ Torgersen │ NULL │ NULL │ NULL │ Adelie │ … │
│ Torgersen │ 19.3 │ 193.0 │ 3450.0 │ Adelie │ … │
│ Torgersen │ 20.6 │ 190.0 │ 3650.0 │ Adelie │ … │
│ Torgersen │ 17.8 │ 181.0 │ 3625.0 │ Adelie │ … │
│ Torgersen │ 19.6 │ 195.0 │ 4675.0 │ Adelie │ … │
│ Torgersen │ 18.1 │ 193.0 │ 3475.0 │ Adelie │ … │
│ Torgersen │ 20.2 │ 190.0 │ 4250.0 │ Adelie │ … │
│ … │ … │ … │ … │ … │ … │
└───────────┴───────────────┴───────────────────┴─────────────┴─────────┴───┘
Work with groups of rows
Simple summaries by group
Let’s suppose we want to compute summaries by groups such as means or medians. Both packages are very similar again: on the R side you have dplyr::group_by()
and dplyr::summarize()
, while on the Python side you have ibis.group_by()
and ibis.aggregate()
.
Note that dplyr::group_by()
also automatically arranges the results by the group, so the reproduce the results of dplyr
, we need to add ibis.order_by()
to the chain.
|>
penguins group_by(island) |>
summarize(bill_depth_mean = mean(bill_depth_mm, na.rm = TRUE))
# A tibble: 3 × 2
island bill_depth_mean
<fct> <dbl>
1 Biscoe 15.9
2 Dream 18.3
3 Torgersen 18.4
(penguins"island")
.group_by(= _.bill_depth_mm.mean())
.aggregate(bill_depth_mean "island")
.order_by( )
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ island ┃ bill_depth_mean ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ string │ float64 │
├───────────┼─────────────────┤
│ Biscoe │ 15.874850 │
│ Dream │ 18.344355 │
│ Torgersen │ 18.429412 │
└───────────┴─────────────────┘
More complicated summaries by group
Typically, you want to create multiple different summaries by groups. dplyr
provides a lot of flexibility to create new variables on the fly, as does ibis
. For instance, we can pass expressions to them mean functions in order to create the share of female penguins per island in the summary statement.
|>
penguins group_by(island) |>
summarize(
count = n(),
bill_depth_mean = mean(bill_depth_mm, na.rm = TRUE),
flipper_length_median = median(flipper_length_mm, na.rm = TRUE),
body_mass_sd = sd(body_mass_g, na.rm = TRUE),
share_female = mean(sex == "female", na.rm = TRUE)
)
# A tibble: 3 × 6
island count bill_depth_mean flipper_length_median body_mass_sd share_female
<fct> <int> <dbl> <dbl> <dbl> <dbl>
1 Biscoe 168 15.9 214 783. 0.491
2 Dream 124 18.3 193 417. 0.496
3 Torgers… 52 18.4 191 445. 0.511
(penguins"island")
.group_by(
.aggregate(= _.count(),
count = _.bill_depth_mm.mean(),
bill_depth_mean = _.flipper_length_mm.median(),
flipper_length_median = _.body_mass_g.std(),
body_mass_sd = (_.sex == "female").mean()
share_female
)"island")
.order_by( )
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━┓
┃ island ┃ count ┃ bill_depth_mean ┃ flipper_length_medi… ┃ body_mass_sd ┃ ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━┩
│ string │ int64 │ float64 │ float64 │ float64 │ │
├───────────┼───────┼─────────────────┼──────────────────────┼──────────────┼──┤
│ Biscoe │ 168 │ 15.874850 │ 214.0 │ 782.855743 │ │
│ Dream │ 124 │ 18.344355 │ 193.0 │ 416.644112 │ │
│ Torgersen │ 52 │ 18.429412 │ 191.0 │ 445.107940 │ │
└───────────┴───────┴─────────────────┴──────────────────────┴──────────────┴──┘
Conclusion
This post highlights syntactic similarities and differences across R’s dplyr
and Python’s ibis
packages. Two key points emerge: (i) dplyr
heavily relies on NSE to enable a syntax that refrains from using strings and column selectors, something that is not possible in Python; (ii) the syntax is remarkably similar across both packages. I want to close this post by emphasizing that both languages and packages have their own merits and I won’t strictly recommend one over the other. However, I definitely prefer the print output of dplyr
to ibis
because the latter is silent about additional columns of the underlying data. I’m a big fan of the concise data printing capabilities that are part of dplyr
.
Footnotes
See the unifying principles of the tidyverse: https://design.tidyverse.org/unifying.html.↩︎