Tidy Data Manipulation: dplyr vs TidierData

R
Julia
Manipulation
A comparison of R’s dplyr and Julia’s TidierData data manipulation packages
Author

Christoph Scheuch

Published

January 9, 2024

There are a myriad of options to perform essential data manipulation tasks in R and Julia. However, if we want to do tidy data science in R, there is a clear forerunner: dplyr. In the world of Julia, TidierData is a relatively new kid on the block that allows R users to dabble in Julia without learning a lot of new syntax. In this blog post, I illustrate their syntactic similarities and highlight differences between these two packages that emerge for a few key tasks.

Before we dive into the comparison, a short introduction to the packages: the dplyr package in R allows users to refer to columns without quotation marks due to its implementation of non-standard evaluation (NSE). NSE is a programming technique used in R that allows functions to capture the expressions passed to them as arguments, rather than just the values of those arguments. The primary goal of NSE in the context of dplyr is to create a more user-friendly and intuitive syntax. This makes data manipulation tasks more straightforward and aligns with the general philosophy of the tidyverse to make data science faster, easier, and more fun.1

TidierData is a 100% Julia implementation of the dplyr and tidyr R packages with three goals: (i) stick as closely to the tidyverse syntax as possible, so that R users find it easier to switch; (ii) make broadcasting2 mostly invisible, so that many functions are automatically vectorized for users; (iii) make scalars and tuples mostly interchangeable, so that users can provide a scalar or a tuple as arguments as they see fit. Check out the package website for more information, in particular with respect to the features of Julia.

Loading packages and data

We start by loading the main packages of interest and the popular palmerpenguins package that exists for both R and Julia. Note that packages in Julia follow a Pascal case convention, so we have TidierData and PalmerPenguins. We then use the penguins data frame as the data to compare all functions and methods below. Note that ENV["LINES"] = 19 sets the print output of Julia data frames to show only 10 rows.3

library(dplyr)
library(palmerpenguins)

penguins <- palmerpenguins::penguins
using TidierData
using PalmerPenguins

penguins = DataFrame(PalmerPenguins.load())
ENV["LINES"] = 19

Work with rows

Filter rows

Filtering rows with dplyr is based on NSE and the dplyr::filter() function. To replicate the same results with TidierData, you can use TidierData.@filter() method which accepts a remarkably similar notation to dplyr with the only exceptions that you need && or || for boolean operators and that you can omit the percentage signs around in.

penguins |> 
  filter(species == "Adelie" & 
           island %in% c("Biscoe", "Dream"))
# A tibble: 100 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Adelie  Biscoe           37.8          18.3               174        3400
 2 Adelie  Biscoe           37.7          18.7               180        3600
 3 Adelie  Biscoe           35.9          19.2               189        3800
 4 Adelie  Biscoe           38.2          18.1               185        3950
 5 Adelie  Biscoe           38.8          17.2               180        3800
 6 Adelie  Biscoe           35.3          18.9               187        3800
 7 Adelie  Biscoe           40.6          18.6               183        3550
 8 Adelie  Biscoe           40.5          17.9               187        3200
 9 Adelie  Biscoe           37.9          18.6               172        3150
10 Adelie  Biscoe           40.5          18.9               180        3950
# ℹ 90 more rows
# ℹ 2 more variables: sex <fct>, year <int>
@chain penguins begin
  @filter(species == "Adelie" &&
            island in ("Biscoe", "Dream"))
end
100×7 DataFrame
 Row │ species   island    bill_length_mm  bill_depth_mm  flipper_length_mm  b ⋯
     │ String15  String15  Float64?        Float64?       Int64?             I ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Adelie    Biscoe              37.8           18.3                174    ⋯
   2 │ Adelie    Biscoe              37.7           18.7                180
   3 │ Adelie    Biscoe              35.9           19.2                189
   4 │ Adelie    Biscoe              38.2           18.1                185
   5 │ Adelie    Biscoe              38.8           17.2                180    ⋯
  ⋮  │    ⋮         ⋮            ⋮               ⋮                ⋮            ⋱
  96 │ Adelie    Dream               36.6           18.4                184
  97 │ Adelie    Dream               36.0           17.8                195
  98 │ Adelie    Dream               37.8           18.1                193
  99 │ Adelie    Dream               36.0           17.1                187    ⋯
 100 │ Adelie    Dream               41.5           18.5                201
                                                   2 columns and 90 rows omitted

Slice rows

dplyr::slice() takes integers with row numbers as inputs, so you can use ranges and arbitrary vectors of integers. TidierData.@slice() does exactly the same.

penguins |> 
  slice(10:20)
# A tibble: 11 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           42            20.2               190        4250
 2 Adelie  Torgersen           37.8          17.1               186        3300
 3 Adelie  Torgersen           37.8          17.3               180        3700
 4 Adelie  Torgersen           41.1          17.6               182        3200
 5 Adelie  Torgersen           38.6          21.2               191        3800
 6 Adelie  Torgersen           34.6          21.1               198        4400
 7 Adelie  Torgersen           36.6          17.8               185        3700
 8 Adelie  Torgersen           38.7          19                 195        3450
 9 Adelie  Torgersen           42.5          20.7               197        4500
10 Adelie  Torgersen           34.4          18.4               184        3325
11 Adelie  Torgersen           46            21.5               194        4200
# ℹ 2 more variables: sex <fct>, year <int>
@chain penguins begin
  @slice(10:20)
end
11×7 DataFrame
 Row │ species   island     bill_length_mm  bill_depth_mm  flipper_length_mm   ⋯
     │ String15  String15   Float64?        Float64?       Int64?              ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Adelie    Torgersen            42.0           20.2                190   ⋯
   2 │ Adelie    Torgersen            37.8           17.1                186
   3 │ Adelie    Torgersen            37.8           17.3                180
   4 │ Adelie    Torgersen            41.1           17.6                182
   5 │ Adelie    Torgersen            38.6           21.2                191   ⋯
   6 │ Adelie    Torgersen            34.6           21.1                198
   7 │ Adelie    Torgersen            36.6           17.8                185
   8 │ Adelie    Torgersen            38.7           19.0                195
   9 │ Adelie    Torgersen            42.5           20.7                197   ⋯
  10 │ Adelie    Torgersen            34.4           18.4                184
  11 │ Adelie    Torgersen            46.0           21.5                194
                                                               2 columns omitted

Arrange rows

To orders the rows of a data frame by the values of selected columns, we have dplyr::arrange() and TidierData.@arrange(). Note that both approaches arrange rows in an an ascending order and puts missing values last as defaults.

penguins |> 
  arrange(island, desc(bill_length_mm))
# A tibble: 344 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo  Biscoe           59.6          17                 230        6050
 2 Gentoo  Biscoe           55.9          17                 228        5600
 3 Gentoo  Biscoe           55.1          16                 230        5850
 4 Gentoo  Biscoe           54.3          15.7               231        5650
 5 Gentoo  Biscoe           53.4          15.8               219        5500
 6 Gentoo  Biscoe           52.5          15.6               221        5450
 7 Gentoo  Biscoe           52.2          17.1               228        5400
 8 Gentoo  Biscoe           52.1          17                 230        5550
 9 Gentoo  Biscoe           51.5          16.3               230        5500
10 Gentoo  Biscoe           51.3          14.2               218        5300
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
@chain penguins begin
  @arrange(island, desc(bill_length_mm))
end
344×7 DataFrame
 Row │ species   island     bill_length_mm  bill_depth_mm  flipper_length_mm   ⋯
     │ String15  String15   Float64?        Float64?       Int64?              ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Gentoo    Biscoe          missing        missing              missing   ⋯
   2 │ Gentoo    Biscoe               59.6           17.0                230
   3 │ Gentoo    Biscoe               55.9           17.0                228
   4 │ Gentoo    Biscoe               55.1           16.0                230
   5 │ Gentoo    Biscoe               54.3           15.7                231   ⋯
  ⋮  │    ⋮          ⋮            ⋮               ⋮                ⋮           ⋱
 340 │ Adelie    Torgersen            34.6           21.1                198
 341 │ Adelie    Torgersen            34.6           17.2                189
 342 │ Adelie    Torgersen            34.4           18.4                184
 343 │ Adelie    Torgersen            34.1           18.1                193   ⋯
 344 │ Adelie    Torgersen            33.5           19.0                190
                                                  2 columns and 334 rows omitted

Work with columns

Select columns

Selecting a subset of columns works exactly the same withdplyr::select() and TidierData.@select().

penguins |> 
  select(bill_length_mm, sex)
# A tibble: 344 × 2
   bill_length_mm sex   
            <dbl> <fct> 
 1           39.1 male  
 2           39.5 female
 3           40.3 female
 4           NA   <NA>  
 5           36.7 female
 6           39.3 male  
 7           38.9 female
 8           39.2 male  
 9           34.1 <NA>  
10           42   <NA>  
# ℹ 334 more rows
@chain penguins begin
  @select(bill_length_mm, sex)
end
344×2 DataFrame
 Row │ bill_length_mm  sex
     │ Float64?        String7
─────┼─────────────────────────
   1 │           39.1  male
   2 │           39.5  female
   3 │           40.3  female
   4 │      missing    missing
   5 │           36.7  female
  ⋮  │       ⋮            ⋮
 340 │           55.8  male
 341 │           43.5  female
 342 │           49.6  male
 343 │           50.8  male
 344 │           50.2  female
               334 rows omitted

Rename columns

Renaming columns also works exactly the same with dplyr::rename() and TidierData.rename().

penguins |> 
  rename(bill_length = bill_length_mm,
         bill_depth = bill_depth_mm)
# A tibble: 344 × 8
   species island    bill_length bill_depth flipper_length_mm body_mass_g sex   
   <fct>   <fct>           <dbl>      <dbl>             <int>       <int> <fct> 
 1 Adelie  Torgersen        39.1       18.7               181        3750 male  
 2 Adelie  Torgersen        39.5       17.4               186        3800 female
 3 Adelie  Torgersen        40.3       18                 195        3250 female
 4 Adelie  Torgersen        NA         NA                  NA          NA <NA>  
 5 Adelie  Torgersen        36.7       19.3               193        3450 female
 6 Adelie  Torgersen        39.3       20.6               190        3650 male  
 7 Adelie  Torgersen        38.9       17.8               181        3625 female
 8 Adelie  Torgersen        39.2       19.6               195        4675 male  
 9 Adelie  Torgersen        34.1       18.1               193        3475 <NA>  
10 Adelie  Torgersen        42         20.2               190        4250 <NA>  
# ℹ 334 more rows
# ℹ 1 more variable: year <int>
@chain penguins begin
  @rename(bill_length = bill_length_mm,
          bill_depth = bill_depth_mm)
end
344×7 DataFrame
 Row │ species    island     bill_length  bill_depth  flipper_length_mm  body_ ⋯
     │ String15   String15   Float64?     Float64?    Int64?             Int64 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Adelie     Torgersen         39.1        18.7                181        ⋯
   2 │ Adelie     Torgersen         39.5        17.4                186
   3 │ Adelie     Torgersen         40.3        18.0                195
   4 │ Adelie     Torgersen    missing     missing              missing      m
   5 │ Adelie     Torgersen         36.7        19.3                193        ⋯
  ⋮  │     ⋮          ⋮           ⋮           ⋮               ⋮                ⋱
 340 │ Chinstrap  Dream             55.8        19.8                207
 341 │ Chinstrap  Dream             43.5        18.1                202
 342 │ Chinstrap  Dream             49.6        18.2                193
 343 │ Chinstrap  Dream             50.8        19.0                210        ⋯
 344 │ Chinstrap  Dream             50.2        18.7                198
                                                  2 columns and 334 rows omitted

Mutate columns

Transforming existing columns or creating new ones is an essential part of data analysis. dplyr::mutate() and TidierData.@mutate() are the work horses for these tasks. Note that you have to split up variable assignments if you want to refer to a newly created variable in TidierData, while you can refer to the new variables in the same mutate block in dplyr.

penguins |> 
  mutate(ones = 1,
         bill_length = bill_length_mm / 10,
         bill_length_squared = bill_length^2) |> 
  select(ones, bill_length_mm, bill_length, bill_length_squared)
# A tibble: 344 × 4
    ones bill_length_mm bill_length bill_length_squared
   <dbl>          <dbl>       <dbl>               <dbl>
 1     1           39.1        3.91                15.3
 2     1           39.5        3.95                15.6
 3     1           40.3        4.03                16.2
 4     1           NA         NA                   NA  
 5     1           36.7        3.67                13.5
 6     1           39.3        3.93                15.4
 7     1           38.9        3.89                15.1
 8     1           39.2        3.92                15.4
 9     1           34.1        3.41                11.6
10     1           42          4.2                 17.6
# ℹ 334 more rows
@chain penguins begin
  @mutate(ones = 1,
          bill_length = bill_length_mm / 10)
  @mutate(bill_length_squared = bill_length^2)
  @select(ones, bill_length_mm, bill_length, bill_length_squared)
end
344×4 DataFrame
 Row │ ones   bill_length_mm  bill_length  bill_length_squared
     │ Int64  Float64?        Float64?     Float64?
─────┼─────────────────────────────────────────────────────────
   1 │     1            39.1         3.91              15.2881
   2 │     1            39.5         3.95              15.6025
   3 │     1            40.3         4.03              16.2409
   4 │     1       missing     missing            missing
   5 │     1            36.7         3.67              13.4689
  ⋮  │   ⋮          ⋮              ⋮                ⋮
 340 │     1            55.8         5.58              31.1364
 341 │     1            43.5         4.35              18.9225
 342 │     1            49.6         4.96              24.6016
 343 │     1            50.8         5.08              25.8064
 344 │     1            50.2         5.02              25.2004
                                               334 rows omitted

Relocate columns

dplyr::relocate() provides options to change the positions of columns in a data frame, using the same syntax as dplyr::select(). In addition, there are the options .after and .before to provide users with additional shortcuts.

The recommended way to relocate columns in TidierData is to use the TidierData.@select() method, but there are no options as in dplyr::relocate(). In fact, the safest way to consistently get the correct order of columns is to explicitly specify them.

penguins |> 
  relocate(c(species, bill_length_mm), .before = sex)
# A tibble: 344 × 8
   island    bill_depth_mm flipper_length_mm body_mass_g species bill_length_mm
   <fct>             <dbl>             <int>       <int> <fct>            <dbl>
 1 Torgersen          18.7               181        3750 Adelie            39.1
 2 Torgersen          17.4               186        3800 Adelie            39.5
 3 Torgersen          18                 195        3250 Adelie            40.3
 4 Torgersen          NA                  NA          NA Adelie            NA  
 5 Torgersen          19.3               193        3450 Adelie            36.7
 6 Torgersen          20.6               190        3650 Adelie            39.3
 7 Torgersen          17.8               181        3625 Adelie            38.9
 8 Torgersen          19.6               195        4675 Adelie            39.2
 9 Torgersen          18.1               193        3475 Adelie            34.1
10 Torgersen          20.2               190        4250 Adelie            42  
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
@chain penguins begin
  @select(island, bill_depth_mm, flipper_length_mm, body_mass_g, 
          species, bill_length_mm, sex)
end
344×7 DataFrame
 Row │ island     bill_depth_mm  flipper_length_mm  body_mass_g  species    bi ⋯
     │ String15   Float64?       Int64?             Int64?       String15   Fl ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Torgersen           18.7                181         3750  Adelie        ⋯
   2 │ Torgersen           17.4                186         3800  Adelie
   3 │ Torgersen           18.0                195         3250  Adelie
   4 │ Torgersen      missing              missing      missing  Adelie
   5 │ Torgersen           19.3                193         3450  Adelie        ⋯
  ⋮  │     ⋮            ⋮                ⋮               ⋮           ⋮         ⋱
 340 │ Dream               19.8                207         4000  Chinstrap
 341 │ Dream               18.1                202         3400  Chinstrap
 342 │ Dream               18.2                193         3775  Chinstrap
 343 │ Dream               19.0                210         4100  Chinstrap     ⋯
 344 │ Dream               18.7                198         3775  Chinstrap
                                                  2 columns and 334 rows omitted

Work with groups of rows

Simple summaries by group

Let’s suppose we want to compute summaries by groups such as means or medians. Both packages are virtually the same again: on the R side you have dplyr::group_by() and dplyr::summarize(), while on the Julia side you have TidierData.@group_by() and TidierData.@summarize(). Note that you have to include the skipmissing() wrapper in order to drop missing values in the mean() function.

Moreover, dplyr also automatically arranges the results by the group, so the reproduce the results of dplyr, we need to add TidierData.@arrange() to the chain.

penguins |> 
  group_by(island) |> 
  summarize(bill_depth_mean = mean(bill_depth_mm, na.rm = TRUE))
# A tibble: 3 × 2
  island    bill_depth_mean
  <fct>               <dbl>
1 Biscoe               15.9
2 Dream                18.3
3 Torgersen            18.4
@chain penguins begin
  @group_by(island) 
  @summarize(bill_depth_mean = mean(skipmissing(bill_depth_mm)))
  @arrange(island)
end
3×2 DataFrame
 Row │ island     bill_depth_mean
     │ String15   Float64
─────┼────────────────────────────
   1 │ Biscoe             15.8749
   2 │ Dream              18.3444
   3 │ Torgersen          18.4294

More complicated summaries by group

Typically, you want to create multiple different summaries by groups. dplyr provides a lot of flexibility to create new variables on the fly, as does TidierData. For instance, we can pass expressions to them mean functions in order to create the share of female penguins per island in the summary statement.

penguins |> 
  group_by(island) |> 
  summarize(count = n(),
            bill_depth_mean = mean(bill_depth_mm, na.rm = TRUE),
            flipper_length_median = median(flipper_length_mm, na.rm = TRUE),
            body_mass_sd = sd(body_mass_g, na.rm = TRUE),
            share_female = mean(sex == "female", na.rm = TRUE))
# A tibble: 3 × 6
  island   count bill_depth_mean flipper_length_median body_mass_sd share_female
  <fct>    <int>           <dbl>                 <dbl>        <dbl>        <dbl>
1 Biscoe     168            15.9                   214         783.        0.491
2 Dream      124            18.3                   193         417.        0.496
3 Torgers…    52            18.4                   191         445.        0.511
@chain penguins begin
  @group_by(island) 
  @summarize(count = n(),
             bill_depth_mean = mean(skipmissing(bill_depth_mm)),
             flipper_length_median = median(skipmissing(flipper_length_mm)),
             body_mass_sd = std(skipmissing(body_mass_g)),
             share_female = mean(skipmissing(sex == "female")))
  @arrange(island)
end
3×6 DataFrame
 Row │ island     count  bill_depth_mean  flipper_length_median  body_mass_sd  ⋯
     │ String15   Int64  Float64          Float64                Float64       ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Biscoe       168          15.8749                  214.0       782.856  ⋯
   2 │ Dream        124          18.3444                  193.0       416.644
   3 │ Torgersen     52          18.4294                  191.0       445.108
                                                                1 column omitted

Conclusion

This post highlights syntactic similarities and differences across R’s dplyr and Julia’s TidierData packages. The key difference is between pipes and chains: dplyr uses the pipe operator |> to chain functions, while TidierData uses the @chain df begin ... end syntax for piping a value through a series of transformation expressions. Nonetheless, the similarities are remarkable and demonstrate the flexibility of Julia to seemingly replicate the NSE capabilities of R. If you want to play around with Julia or some of its packages, I can highly recommend to take a shortcut using TidierData.

Footnotes

  1. See the unifying principles of the tidyverse: https://design.tidyverse.org/unifying.html.↩︎

  2. Broadcasting expands singleton dimensions in array arguments to match the corresponding dimension in the other array without using extra memory.↩︎

  3. In Julia, setting the number of rows to display for all DataFrames globally isn’t straightforward as there isn’t a direct global setting for this in the DataFrame package. ENV["COLUMNS"] or ENV["LINES"] control the display based on the size of your terminal.↩︎