Tidy Data Manipulation: dplyr vs ibis

R
Python
Manipulation
A comparison of R’s dplyr and Python’s ibis data manipulation packages
Author

Christoph Scheuch

Published

January 4, 2024

There are a myriad of options to perform essential data manipulation tasks in R and Python (see, for instance, my other posts on dplyr vs pandas and dplyr vs polars). However, if we want to do tidy data science in R, there is a clear forerunner: dplyr. In the world of Python, ibis has been around since 2015 but recently gained traction due to its appealing flexibility with respect to data backends. In this blog post, I illustrate their syntactic similarities and highlight differences between these two packages that emerge for a few key tasks.

Before we dive into the comparison, a short introduction to the packages: the dplyr package in R allows users to refer to columns without quotation marks due to its implementation of non-standard evaluation (NSE). NSE is a programming technique used in R that allows functions to capture the expressions passed to them as arguments, rather than just the values of those arguments. The primary goal of NSE in the context of dplyr is to create a more user-friendly and intuitive syntax. This makes data manipulation tasks more straightforward and aligns with the general philosophy of the tidyverse to make data science faster, easier, and more fun.1

ibis is a Python library that provides a lightweight and universal interface for data wrangling using many different data backends. The core idea behind ibis is to provide Python users with a familiar pandas-like syntax while allowing them to work with larger datasets that don’t fit into memory. As you see in the post below, the ibis syntax can be surprisingly closer to dplyr than to the original idea of resembling pandas. In addition, ibis builds an expression tree as you write code. This tree is then translated into the native query language of the target data source, be it SQL or something else, and executed remotely (similar to the dbplyr package in R). This approach ensures that only the final results are loaded into Python, significantly reducing memory overhead.

Loading packages and data

We start by loading the main packages of interest and the popular palmerpenguins package that exists for both R and Python. We then use the penguins data frame as the data to compare all functions and methods below. Note that we also enable the interactive mode in ibis to limit the print output of ibis data frames to 10 rows.

ibis-framework vs ibis

Note that the ibis-framework package is not the same as the ibis package in PyPI. These two libraries cannot coexist in the same Python environment, as they are both imported with the ibis module name. So be careful to install the correct ibis-framework package via: pip install 'ibis-framework[duckdb]'

library(dplyr)
library(palmerpenguins)

penguins <- palmerpenguins::penguins
import ibis
import ibis.selectors as s
from ibis import _
from palmerpenguins import load_penguins

ibis.options.interactive = True

penguins = ibis.memtable(load_penguins(), name = "penguins")

Work with rows

Filter rows

Filtering rows works very similarly for both packages, they even have the same function names: dplyr::filter() and ibis.filter(). To select columns in ibis, you need the ibis._ selector. Note that you have to provide a dictionary to ibis.filter() in case you want to have multiple conditions.

penguins |> 
  filter(species == "Adelie" & 
           island %in% c("Biscoe", "Dream"))
# A tibble: 100 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Adelie  Biscoe           37.8          18.3               174        3400
 2 Adelie  Biscoe           37.7          18.7               180        3600
 3 Adelie  Biscoe           35.9          19.2               189        3800
 4 Adelie  Biscoe           38.2          18.1               185        3950
 5 Adelie  Biscoe           38.8          17.2               180        3800
 6 Adelie  Biscoe           35.3          18.9               187        3800
 7 Adelie  Biscoe           40.6          18.6               183        3550
 8 Adelie  Biscoe           40.5          17.9               187        3200
 9 Adelie  Biscoe           37.9          18.6               172        3150
10 Adelie  Biscoe           40.5          18.9               180        3950
# ℹ 90 more rows
# ℹ 2 more variables: sex <fct>, year <int>
(penguins
  .filter([
    _.species == "Adelie", 
    _.island.isin(["Biscoe", "Dream"])
  ]) 
)
┏━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━┓
┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ … ┃
┡━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━┩
│ string  │ string │ float64        │ float64       │ float64           │ … │
├─────────┼────────┼────────────────┼───────────────┼───────────────────┼───┤
│ Adelie  │ Biscoe │           37.8 │          18.3 │             174.0 │ … │
│ Adelie  │ Biscoe │           37.7 │          18.7 │             180.0 │ … │
│ Adelie  │ Biscoe │           35.9 │          19.2 │             189.0 │ … │
│ Adelie  │ Biscoe │           38.2 │          18.1 │             185.0 │ … │
│ Adelie  │ Biscoe │           38.8 │          17.2 │             180.0 │ … │
│ Adelie  │ Biscoe │           35.3 │          18.9 │             187.0 │ … │
│ Adelie  │ Biscoe │           40.6 │          18.6 │             183.0 │ … │
│ Adelie  │ Biscoe │           40.5 │          17.9 │             187.0 │ … │
│ Adelie  │ Biscoe │           37.9 │          18.6 │             172.0 │ … │
│ Adelie  │ Biscoe │           40.5 │          18.9 │             180.0 │ … │
│ …       │ …      │              … │             … │                 … │ … │
└─────────┴────────┴────────────────┴───────────────┴───────────────────┴───┘

Slice rows

dplyr::slice() takes integers with row numbers as inputs, so you can use ranges and arbitrary vectors of integers. ibis.limit() only takes the number of rows to slice and the number of rows to skip as inputs. For instance, to the the same result of slicing rows 10 to 20, the code looks as follows (note that indexing starts at 0 in Python, while it starts at 1 in R):

penguins |> 
  slice(10:20)
# A tibble: 11 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           42            20.2               190        4250
 2 Adelie  Torgersen           37.8          17.1               186        3300
 3 Adelie  Torgersen           37.8          17.3               180        3700
 4 Adelie  Torgersen           41.1          17.6               182        3200
 5 Adelie  Torgersen           38.6          21.2               191        3800
 6 Adelie  Torgersen           34.6          21.1               198        4400
 7 Adelie  Torgersen           36.6          17.8               185        3700
 8 Adelie  Torgersen           38.7          19                 195        3450
 9 Adelie  Torgersen           42.5          20.7               197        4500
10 Adelie  Torgersen           34.4          18.4               184        3325
11 Adelie  Torgersen           46            21.5               194        4200
# ℹ 2 more variables: sex <fct>, year <int>
(penguins
  .limit(11, offset = 9) 
)
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━┓
┃ species ┃ island    ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ … ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━┩
│ string  │ string    │ float64        │ float64       │ float64           │ … │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼───┤
│ Adelie  │ Torgersen │           42.0 │          20.2 │             190.0 │ … │
│ Adelie  │ Torgersen │           37.8 │          17.1 │             186.0 │ … │
│ Adelie  │ Torgersen │           37.8 │          17.3 │             180.0 │ … │
│ Adelie  │ Torgersen │           41.1 │          17.6 │             182.0 │ … │
│ Adelie  │ Torgersen │           38.6 │          21.2 │             191.0 │ … │
│ Adelie  │ Torgersen │           34.6 │          21.1 │             198.0 │ … │
│ Adelie  │ Torgersen │           36.6 │          17.8 │             185.0 │ … │
│ Adelie  │ Torgersen │           38.7 │          19.0 │             195.0 │ … │
│ Adelie  │ Torgersen │           42.5 │          20.7 │             197.0 │ … │
│ Adelie  │ Torgersen │           34.4 │          18.4 │             184.0 │ … │
│ …       │ …         │              … │             … │                 … │ … │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴───┘

Arrange rows

To orders the rows of a data frame by the values of selected columns, we have dplyr::arrange() and ibis.order_by(). Both approaches arrange rows in an an ascending order and puts missing values last. Again, you need to provide a dictionary to ibis.order_by().

penguins |> 
  arrange(island, desc(bill_length_mm))
# A tibble: 344 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo  Biscoe           59.6          17                 230        6050
 2 Gentoo  Biscoe           55.9          17                 228        5600
 3 Gentoo  Biscoe           55.1          16                 230        5850
 4 Gentoo  Biscoe           54.3          15.7               231        5650
 5 Gentoo  Biscoe           53.4          15.8               219        5500
 6 Gentoo  Biscoe           52.5          15.6               221        5450
 7 Gentoo  Biscoe           52.2          17.1               228        5400
 8 Gentoo  Biscoe           52.1          17                 230        5550
 9 Gentoo  Biscoe           51.5          16.3               230        5500
10 Gentoo  Biscoe           51.3          14.2               218        5300
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
(penguins
  .order_by([_.island, _.bill_length_mm.desc()])
)
┏━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━┓
┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ … ┃
┡━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━┩
│ string  │ string │ float64        │ float64       │ float64           │ … │
├─────────┼────────┼────────────────┼───────────────┼───────────────────┼───┤
│ Gentoo  │ Biscoe │           59.6 │          17.0 │             230.0 │ … │
│ Gentoo  │ Biscoe │           55.9 │          17.0 │             228.0 │ … │
│ Gentoo  │ Biscoe │           55.1 │          16.0 │             230.0 │ … │
│ Gentoo  │ Biscoe │           54.3 │          15.7 │             231.0 │ … │
│ Gentoo  │ Biscoe │           53.4 │          15.8 │             219.0 │ … │
│ Gentoo  │ Biscoe │           52.5 │          15.6 │             221.0 │ … │
│ Gentoo  │ Biscoe │           52.2 │          17.1 │             228.0 │ … │
│ Gentoo  │ Biscoe │           52.1 │          17.0 │             230.0 │ … │
│ Gentoo  │ Biscoe │           51.5 │          16.3 │             230.0 │ … │
│ Gentoo  │ Biscoe │           51.3 │          14.2 │             218.0 │ … │
│ …       │ …      │              … │             … │                 … │ … │
└─────────┴────────┴────────────────┴───────────────┴───────────────────┴───┘

Work with columns

Select columns

Selecting a subset of columns works essentially the same for both and dplyr::select() and ibis.select() even have the same name. Note that you don’t have to use ibis._ but can also just pass strings in the ibis.select() method.

penguins |> 
  select(bill_length_mm, sex)
# A tibble: 344 × 2
   bill_length_mm sex   
            <dbl> <fct> 
 1           39.1 male  
 2           39.5 female
 3           40.3 female
 4           NA   <NA>  
 5           36.7 female
 6           39.3 male  
 7           38.9 female
 8           39.2 male  
 9           34.1 <NA>  
10           42   <NA>  
# ℹ 334 more rows
(penguins
  .select(_.bill_length_mm, _.sex)
)
┏━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ bill_length_mm ┃ sex    ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ float64        │ string │
├────────────────┼────────┤
│           39.1 │ male   │
│           39.5 │ female │
│           40.3 │ female │
│           NULL │ NULL   │
│           36.7 │ female │
│           39.3 │ male   │
│           38.9 │ female │
│           39.2 │ male   │
│           34.1 │ NULL   │
│           42.0 │ NULL   │
│              … │ …      │
└────────────────┴────────┘

Rename columns

Renaming columns also works very similarly with the major difference that ibis.rename() does not accept the column selector ibis._ on the right-hand side, while dplyr::rename() takes variable names via the usual NSE.

penguins |> 
  rename(bill_length = bill_length_mm,
         bill_depth = bill_depth_mm)
# A tibble: 344 × 8
   species island    bill_length bill_depth flipper_length_mm body_mass_g sex   
   <fct>   <fct>           <dbl>      <dbl>             <int>       <int> <fct> 
 1 Adelie  Torgersen        39.1       18.7               181        3750 male  
 2 Adelie  Torgersen        39.5       17.4               186        3800 female
 3 Adelie  Torgersen        40.3       18                 195        3250 female
 4 Adelie  Torgersen        NA         NA                  NA          NA <NA>  
 5 Adelie  Torgersen        36.7       19.3               193        3450 female
 6 Adelie  Torgersen        39.3       20.6               190        3650 male  
 7 Adelie  Torgersen        38.9       17.8               181        3625 female
 8 Adelie  Torgersen        39.2       19.6               195        4675 male  
 9 Adelie  Torgersen        34.1       18.1               193        3475 <NA>  
10 Adelie  Torgersen        42         20.2               190        4250 <NA>  
# ℹ 334 more rows
# ℹ 1 more variable: year <int>
(penguins
  .rename(bill_length = "bill_length_mm", 
          bill_depth = "bill_depth_mm")
)
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━┓
┃ species ┃ island    ┃ bill_length ┃ bill_depth ┃ flipper_length_mm ┃ … ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━┩
│ string  │ string    │ float64     │ float64    │ float64           │ … │
├─────────┼───────────┼─────────────┼────────────┼───────────────────┼───┤
│ Adelie  │ Torgersen │        39.1 │       18.7 │             181.0 │ … │
│ Adelie  │ Torgersen │        39.5 │       17.4 │             186.0 │ … │
│ Adelie  │ Torgersen │        40.3 │       18.0 │             195.0 │ … │
│ Adelie  │ Torgersen │        NULL │       NULL │              NULL │ … │
│ Adelie  │ Torgersen │        36.7 │       19.3 │             193.0 │ … │
│ Adelie  │ Torgersen │        39.3 │       20.6 │             190.0 │ … │
│ Adelie  │ Torgersen │        38.9 │       17.8 │             181.0 │ … │
│ Adelie  │ Torgersen │        39.2 │       19.6 │             195.0 │ … │
│ Adelie  │ Torgersen │        34.1 │       18.1 │             193.0 │ … │
│ Adelie  │ Torgersen │        42.0 │       20.2 │             190.0 │ … │
│ …       │ …         │           … │          … │                 … │ … │
└─────────┴───────────┴─────────────┴────────────┴───────────────────┴───┘

Mutate columns

Transforming existing columns or creating new ones is an essential part of data analysis. dplyr::mutate() and ibis.mutate() are the work horses for these tasks. A big difference between dplyr::mutate() and ibis.mutate() is that in the latter you have to chain separate mutate calls together when you reference newly-created columns in the same mutate whereas in dplyr, you can put them all in the same call.

penguins |> 
  mutate(ones = 1,
         bill_length = bill_length_mm / 10,
         bill_length_squared = bill_length^2) |> 
  select(ones, bill_length_mm, bill_length, bill_length_squared)
# A tibble: 344 × 4
    ones bill_length_mm bill_length bill_length_squared
   <dbl>          <dbl>       <dbl>               <dbl>
 1     1           39.1        3.91                15.3
 2     1           39.5        3.95                15.6
 3     1           40.3        4.03                16.2
 4     1           NA         NA                   NA  
 5     1           36.7        3.67                13.5
 6     1           39.3        3.93                15.4
 7     1           38.9        3.89                15.1
 8     1           39.2        3.92                15.4
 9     1           34.1        3.41                11.6
10     1           42          4.2                 17.6
# ℹ 334 more rows
(penguins 
  .mutate(ones = 1, 
          bill_length = _.bill_length_mm / 10)
  .mutate(bill_length_squared = _.bill_length**2)
  .select(_.ones, _.bill_length_mm, _.bill_length, _.bill_length_squared)
)
┏━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ ones ┃ bill_length_mm ┃ bill_length ┃ bill_length_squared ┃
┡━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ int8 │ float64        │ float64     │ float64             │
├──────┼────────────────┼─────────────┼─────────────────────┤
│    1 │           39.1 │        3.91 │             15.2881 │
│    1 │           39.5 │        3.95 │             15.6025 │
│    1 │           40.3 │        4.03 │             16.2409 │
│    1 │           NULL │        NULL │                NULL │
│    1 │           36.7 │        3.67 │             13.4689 │
│    1 │           39.3 │        3.93 │             15.4449 │
│    1 │           38.9 │        3.89 │             15.1321 │
│    1 │           39.2 │        3.92 │             15.3664 │
│    1 │           34.1 │        3.41 │             11.6281 │
│    1 │           42.0 │        4.20 │             17.6400 │
│    … │              … │           … │                   … │
└──────┴────────────────┴─────────────┴─────────────────────┘

Relocate columns

dplyr::relocate() provides options to change the positions of columns in a data frame, using the same syntax as dplyr::select(). In addition, there are the options .after and .before to provide users with additional shortcuts.

The recommended way to relocate columns in ibis is to use the ibis.select() method, but there are no options as in dplyr::relocate(). In fact, the safest way to consistently get the correct order of columns is to explicitly specify them.

penguins |> 
  relocate(c(species, bill_length_mm), .before = sex)
# A tibble: 344 × 8
   island    bill_depth_mm flipper_length_mm body_mass_g species bill_length_mm
   <fct>             <dbl>             <int>       <int> <fct>            <dbl>
 1 Torgersen          18.7               181        3750 Adelie            39.1
 2 Torgersen          17.4               186        3800 Adelie            39.5
 3 Torgersen          18                 195        3250 Adelie            40.3
 4 Torgersen          NA                  NA          NA Adelie            NA  
 5 Torgersen          19.3               193        3450 Adelie            36.7
 6 Torgersen          20.6               190        3650 Adelie            39.3
 7 Torgersen          17.8               181        3625 Adelie            38.9
 8 Torgersen          19.6               195        4675 Adelie            39.2
 9 Torgersen          18.1               193        3475 Adelie            34.1
10 Torgersen          20.2               190        4250 Adelie            42  
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
(penguins
  .select(_.island, _.bill_depth_mm, _.flipper_length_mm, _.body_mass_g, 
          _.species, _.bill_length_mm, _.sex)
)
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━┓
┃ island    ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ species ┃ … ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━┩
│ string    │ float64       │ float64           │ float64     │ string  │ … │
├───────────┼───────────────┼───────────────────┼─────────────┼─────────┼───┤
│ Torgersen │          18.7 │             181.0 │      3750.0 │ Adelie  │ … │
│ Torgersen │          17.4 │             186.0 │      3800.0 │ Adelie  │ … │
│ Torgersen │          18.0 │             195.0 │      3250.0 │ Adelie  │ … │
│ Torgersen │          NULL │              NULL │        NULL │ Adelie  │ … │
│ Torgersen │          19.3 │             193.0 │      3450.0 │ Adelie  │ … │
│ Torgersen │          20.6 │             190.0 │      3650.0 │ Adelie  │ … │
│ Torgersen │          17.8 │             181.0 │      3625.0 │ Adelie  │ … │
│ Torgersen │          19.6 │             195.0 │      4675.0 │ Adelie  │ … │
│ Torgersen │          18.1 │             193.0 │      3475.0 │ Adelie  │ … │
│ Torgersen │          20.2 │             190.0 │      4250.0 │ Adelie  │ … │
│ …         │             … │                 … │           … │ …       │ … │
└───────────┴───────────────┴───────────────────┴─────────────┴─────────┴───┘

Work with groups of rows

Simple summaries by group

Let’s suppose we want to compute summaries by groups such as means or medians. Both packages are very similar again: on the R side you have dplyr::group_by() and dplyr::summarize(), while on the Python side you have ibis.group_by() and ibis.aggregate().

Note that dplyr::group_by() also automatically arranges the results by the group, so the reproduce the results of dplyr, we need to add ibis.order_by() to the chain.

penguins |> 
  group_by(island) |> 
  summarize(bill_depth_mean = mean(bill_depth_mm, na.rm = TRUE))
# A tibble: 3 × 2
  island    bill_depth_mean
  <fct>               <dbl>
1 Biscoe               15.9
2 Dream                18.3
3 Torgersen            18.4
(penguins
  .group_by("island")
  .aggregate(bill_depth_mean = _.bill_depth_mm.mean())
  .order_by("island")
)
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ island    ┃ bill_depth_mean ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ string    │ float64         │
├───────────┼─────────────────┤
│ Biscoe    │       15.874850 │
│ Dream     │       18.344355 │
│ Torgersen │       18.429412 │
└───────────┴─────────────────┘

More complicated summaries by group

Typically, you want to create multiple different summaries by groups. dplyr provides a lot of flexibility to create new variables on the fly, as does ibis. For instance, we can pass expressions to them mean functions in order to create the share of female penguins per island in the summary statement.

penguins |> 
  group_by(island) |> 
  summarize(
    count = n(),
    bill_depth_mean = mean(bill_depth_mm, na.rm = TRUE),
    flipper_length_median = median(flipper_length_mm, na.rm = TRUE),
    body_mass_sd = sd(body_mass_g, na.rm = TRUE),
    share_female = mean(sex == "female", na.rm = TRUE)
  )
# A tibble: 3 × 6
  island   count bill_depth_mean flipper_length_median body_mass_sd share_female
  <fct>    <int>           <dbl>                 <dbl>        <dbl>        <dbl>
1 Biscoe     168            15.9                   214         783.        0.491
2 Dream      124            18.3                   193         417.        0.496
3 Torgers…    52            18.4                   191         445.        0.511
(penguins
  .group_by("island")
  .aggregate(
    count = _.count(),
    bill_depth_mean = _.bill_depth_mm.mean(),
    flipper_length_median = _.flipper_length_mm.median(),
    body_mass_sd = _.body_mass_g.std(),
    share_female = (_.sex == "female").mean()
  )
  .order_by("island")
)
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━┓
┃ island    ┃ count ┃ bill_depth_mean ┃ flipper_length_medi… ┃ body_mass_sd ┃  ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━┩
│ string    │ int64 │ float64         │ float64              │ float64      │  │
├───────────┼───────┼─────────────────┼──────────────────────┼──────────────┼──┤
│ Biscoe    │   168 │       15.874850 │                214.0 │   782.855743 │  │
│ Dream     │   124 │       18.344355 │                193.0 │   416.644112 │  │
│ Torgersen │    52 │       18.429412 │                191.0 │   445.107940 │  │
└───────────┴───────┴─────────────────┴──────────────────────┴──────────────┴──┘

Conclusion

This post highlights syntactic similarities and differences across R’s dplyr and Python’s ibis packages. Two key points emerge: (i) dplyr heavily relies on NSE to enable a syntax that refrains from using strings and column selectors, something that is not possible in Python; (ii) the syntax is remarkably similar across both packages. I want to close this post by emphasizing that both languages and packages have their own merits and I won’t strictly recommend one over the other. However, I definitely prefer the print output of dplyr to ibis because the latter is silent about additional columns of the underlying data. I’m a big fan of the concise data printing capabilities that are part of dplyr.

Footnotes

  1. See the unifying principles of the tidyverse: https://design.tidyverse.org/unifying.html.↩︎