library(dplyr)
library(palmerpenguins)
<- palmerpenguins::penguins penguins
Tidy Data Manipulation: dplyr vs TidierData
There are a myriad of options to perform essential data manipulation tasks in R and Julia. However, if we want to do tidy data science in R, there is a clear forerunner: dplyr
. In the world of Julia, TidierData
is a relatively new kid on the block that allows R users to dabble in Julia without learning a lot of new syntax. In this blog post, I illustrate their syntactic similarities and highlight differences between these two packages that emerge for a few key tasks.
Before we dive into the comparison, a short introduction to the packages: the dplyr
package in R allows users to refer to columns without quotation marks due to its implementation of non-standard evaluation (NSE). NSE is a programming technique used in R that allows functions to capture the expressions passed to them as arguments, rather than just the values of those arguments. The primary goal of NSE in the context of dplyr
is to create a more user-friendly and intuitive syntax. This makes data manipulation tasks more straightforward and aligns with the general philosophy of the tidyverse
to make data science faster, easier, and more fun.1
TidierData
is a 100% Julia implementation of the dplyr
and tidyr
R packages with three goals: (i) stick as closely to the tidyverse
syntax as possible, so that R users find it easier to switch; (ii) make broadcasting2 mostly invisible, so that many functions are automatically vectorized for users; (iii) make scalars and tuples mostly interchangeable, so that users can provide a scalar or a tuple as arguments as they see fit. Check out the package website for more information, in particular with respect to the features of Julia.
Loading packages and data
We start by loading the main packages of interest and the popular palmerpenguins
package that exists for both R and Julia. Note that packages in Julia follow a Pascal case convention, so we have TidierData
and PalmerPenguins
. We then use the penguins
data frame as the data to compare all functions and methods below. Note that ENV["LINES"] = 19
sets the print output of Julia data frames to show only 10 rows.3
using TidierData
using PalmerPenguins
= DataFrame(PalmerPenguins.load())
penguins ENV["LINES"] = 19
Work with rows
Filter rows
Filtering rows with dplyr
is based on NSE and the dplyr::filter()
function. To replicate the same results with TidierData
, you can use TidierData.@filter()
method which accepts a remarkably similar notation to dplyr
with the only exceptions that you need &&
or ||
for boolean operators and that you can omit the percentage signs around in
.
|>
penguins filter(species == "Adelie" &
%in% c("Biscoe", "Dream")) island
# A tibble: 100 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Biscoe 37.8 18.3 174 3400
2 Adelie Biscoe 37.7 18.7 180 3600
3 Adelie Biscoe 35.9 19.2 189 3800
4 Adelie Biscoe 38.2 18.1 185 3950
5 Adelie Biscoe 38.8 17.2 180 3800
6 Adelie Biscoe 35.3 18.9 187 3800
7 Adelie Biscoe 40.6 18.6 183 3550
8 Adelie Biscoe 40.5 17.9 187 3200
9 Adelie Biscoe 37.9 18.6 172 3150
10 Adelie Biscoe 40.5 18.9 180 3950
# ℹ 90 more rows
# ℹ 2 more variables: sex <fct>, year <int>
@chain penguins begin
@filter(species == "Adelie" &&
in ("Biscoe", "Dream"))
island end
100×7 DataFrame
Row │ species island bill_length_mm bill_depth_mm flipper_length_mm b ⋯
│ String15 String15 Float64? Float64? Int64? I ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ Adelie Biscoe 37.8 18.3 174 ⋯
2 │ Adelie Biscoe 37.7 18.7 180
3 │ Adelie Biscoe 35.9 19.2 189
4 │ Adelie Biscoe 38.2 18.1 185
5 │ Adelie Biscoe 38.8 17.2 180 ⋯
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
96 │ Adelie Dream 36.6 18.4 184
97 │ Adelie Dream 36.0 17.8 195
98 │ Adelie Dream 37.8 18.1 193
99 │ Adelie Dream 36.0 17.1 187 ⋯
100 │ Adelie Dream 41.5 18.5 201
2 columns and 90 rows omitted
Slice rows
dplyr::slice()
takes integers with row numbers as inputs, so you can use ranges and arbitrary vectors of integers. TidierData.@slice()
does exactly the same.
|>
penguins slice(10:20)
# A tibble: 11 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 42 20.2 190 4250
2 Adelie Torgersen 37.8 17.1 186 3300
3 Adelie Torgersen 37.8 17.3 180 3700
4 Adelie Torgersen 41.1 17.6 182 3200
5 Adelie Torgersen 38.6 21.2 191 3800
6 Adelie Torgersen 34.6 21.1 198 4400
7 Adelie Torgersen 36.6 17.8 185 3700
8 Adelie Torgersen 38.7 19 195 3450
9 Adelie Torgersen 42.5 20.7 197 4500
10 Adelie Torgersen 34.4 18.4 184 3325
11 Adelie Torgersen 46 21.5 194 4200
# ℹ 2 more variables: sex <fct>, year <int>
@chain penguins begin
@slice(10:20)
end
11×7 DataFrame
Row │ species island bill_length_mm bill_depth_mm flipper_length_mm ⋯
│ String15 String15 Float64? Float64? Int64? ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ Adelie Torgersen 42.0 20.2 190 ⋯
2 │ Adelie Torgersen 37.8 17.1 186
3 │ Adelie Torgersen 37.8 17.3 180
4 │ Adelie Torgersen 41.1 17.6 182
5 │ Adelie Torgersen 38.6 21.2 191 ⋯
6 │ Adelie Torgersen 34.6 21.1 198
7 │ Adelie Torgersen 36.6 17.8 185
8 │ Adelie Torgersen 38.7 19.0 195
9 │ Adelie Torgersen 42.5 20.7 197 ⋯
10 │ Adelie Torgersen 34.4 18.4 184
11 │ Adelie Torgersen 46.0 21.5 194
2 columns omitted
Arrange rows
To orders the rows of a data frame by the values of selected columns, we have dplyr::arrange()
and TidierData.@arrange()
. Note that both approaches arrange rows in an an ascending order and puts missing values last as defaults.
|>
penguins arrange(island, desc(bill_length_mm))
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Gentoo Biscoe 59.6 17 230 6050
2 Gentoo Biscoe 55.9 17 228 5600
3 Gentoo Biscoe 55.1 16 230 5850
4 Gentoo Biscoe 54.3 15.7 231 5650
5 Gentoo Biscoe 53.4 15.8 219 5500
6 Gentoo Biscoe 52.5 15.6 221 5450
7 Gentoo Biscoe 52.2 17.1 228 5400
8 Gentoo Biscoe 52.1 17 230 5550
9 Gentoo Biscoe 51.5 16.3 230 5500
10 Gentoo Biscoe 51.3 14.2 218 5300
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
@chain penguins begin
@arrange(island, desc(bill_length_mm))
end
344×7 DataFrame
Row │ species island bill_length_mm bill_depth_mm flipper_length_mm ⋯
│ String15 String15 Float64? Float64? Int64? ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ Gentoo Biscoe missing missing missing ⋯
2 │ Gentoo Biscoe 59.6 17.0 230
3 │ Gentoo Biscoe 55.9 17.0 228
4 │ Gentoo Biscoe 55.1 16.0 230
5 │ Gentoo Biscoe 54.3 15.7 231 ⋯
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
340 │ Adelie Torgersen 34.6 21.1 198
341 │ Adelie Torgersen 34.6 17.2 189
342 │ Adelie Torgersen 34.4 18.4 184
343 │ Adelie Torgersen 34.1 18.1 193 ⋯
344 │ Adelie Torgersen 33.5 19.0 190
2 columns and 334 rows omitted
Work with columns
Select columns
Selecting a subset of columns works exactly the same withdplyr::select()
and TidierData.@select()
.
|>
penguins select(bill_length_mm, sex)
# A tibble: 344 × 2
bill_length_mm sex
<dbl> <fct>
1 39.1 male
2 39.5 female
3 40.3 female
4 NA <NA>
5 36.7 female
6 39.3 male
7 38.9 female
8 39.2 male
9 34.1 <NA>
10 42 <NA>
# ℹ 334 more rows
@chain penguins begin
@select(bill_length_mm, sex)
end
344×2 DataFrame
Row │ bill_length_mm sex
│ Float64? String7
─────┼─────────────────────────
1 │ 39.1 male
2 │ 39.5 female
3 │ 40.3 female
4 │ missing missing
5 │ 36.7 female
⋮ │ ⋮ ⋮
340 │ 55.8 male
341 │ 43.5 female
342 │ 49.6 male
343 │ 50.8 male
344 │ 50.2 female
334 rows omitted
Rename columns
Renaming columns also works exactly the same with dplyr::rename()
and TidierData.rename()
.
|>
penguins rename(bill_length = bill_length_mm,
bill_depth = bill_depth_mm)
# A tibble: 344 × 8
species island bill_length bill_depth flipper_length_mm body_mass_g sex
<fct> <fct> <dbl> <dbl> <int> <int> <fct>
1 Adelie Torgersen 39.1 18.7 181 3750 male
2 Adelie Torgersen 39.5 17.4 186 3800 female
3 Adelie Torgersen 40.3 18 195 3250 female
4 Adelie Torgersen NA NA NA NA <NA>
5 Adelie Torgersen 36.7 19.3 193 3450 female
6 Adelie Torgersen 39.3 20.6 190 3650 male
7 Adelie Torgersen 38.9 17.8 181 3625 female
8 Adelie Torgersen 39.2 19.6 195 4675 male
9 Adelie Torgersen 34.1 18.1 193 3475 <NA>
10 Adelie Torgersen 42 20.2 190 4250 <NA>
# ℹ 334 more rows
# ℹ 1 more variable: year <int>
@chain penguins begin
@rename(bill_length = bill_length_mm,
= bill_depth_mm)
bill_depth end
344×7 DataFrame
Row │ species island bill_length bill_depth flipper_length_mm body_ ⋯
│ String15 String15 Float64? Float64? Int64? Int64 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ Adelie Torgersen 39.1 18.7 181 ⋯
2 │ Adelie Torgersen 39.5 17.4 186
3 │ Adelie Torgersen 40.3 18.0 195
4 │ Adelie Torgersen missing missing missing m
5 │ Adelie Torgersen 36.7 19.3 193 ⋯
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
340 │ Chinstrap Dream 55.8 19.8 207
341 │ Chinstrap Dream 43.5 18.1 202
342 │ Chinstrap Dream 49.6 18.2 193
343 │ Chinstrap Dream 50.8 19.0 210 ⋯
344 │ Chinstrap Dream 50.2 18.7 198
2 columns and 334 rows omitted
Mutate columns
Transforming existing columns or creating new ones is an essential part of data analysis. dplyr::mutate()
and TidierData.@mutate()
are the work horses for these tasks. Note that you have to split up variable assignments if you want to refer to a newly created variable in TidierData
, while you can refer to the new variables in the same mutate block in dplyr
.
|>
penguins mutate(ones = 1,
bill_length = bill_length_mm / 10,
bill_length_squared = bill_length^2) |>
select(ones, bill_length_mm, bill_length, bill_length_squared)
# A tibble: 344 × 4
ones bill_length_mm bill_length bill_length_squared
<dbl> <dbl> <dbl> <dbl>
1 1 39.1 3.91 15.3
2 1 39.5 3.95 15.6
3 1 40.3 4.03 16.2
4 1 NA NA NA
5 1 36.7 3.67 13.5
6 1 39.3 3.93 15.4
7 1 38.9 3.89 15.1
8 1 39.2 3.92 15.4
9 1 34.1 3.41 11.6
10 1 42 4.2 17.6
# ℹ 334 more rows
@chain penguins begin
@mutate(ones = 1,
= bill_length_mm / 10)
bill_length @mutate(bill_length_squared = bill_length^2)
@select(ones, bill_length_mm, bill_length, bill_length_squared)
end
344×4 DataFrame
Row │ ones bill_length_mm bill_length bill_length_squared
│ Int64 Float64? Float64? Float64?
─────┼─────────────────────────────────────────────────────────
1 │ 1 39.1 3.91 15.2881
2 │ 1 39.5 3.95 15.6025
3 │ 1 40.3 4.03 16.2409
4 │ 1 missing missing missing
5 │ 1 36.7 3.67 13.4689
⋮ │ ⋮ ⋮ ⋮ ⋮
340 │ 1 55.8 5.58 31.1364
341 │ 1 43.5 4.35 18.9225
342 │ 1 49.6 4.96 24.6016
343 │ 1 50.8 5.08 25.8064
344 │ 1 50.2 5.02 25.2004
334 rows omitted
Relocate columns
dplyr::relocate()
provides options to change the positions of columns in a data frame, using the same syntax as dplyr::select()
. In addition, there are the options .after
and .before
to provide users with additional shortcuts.
The recommended way to relocate columns in TidierData
is to use the TidierData.@select()
method, but there are no options as in dplyr::relocate()
. In fact, the safest way to consistently get the correct order of columns is to explicitly specify them.
|>
penguins relocate(c(species, bill_length_mm), .before = sex)
# A tibble: 344 × 8
island bill_depth_mm flipper_length_mm body_mass_g species bill_length_mm
<fct> <dbl> <int> <int> <fct> <dbl>
1 Torgersen 18.7 181 3750 Adelie 39.1
2 Torgersen 17.4 186 3800 Adelie 39.5
3 Torgersen 18 195 3250 Adelie 40.3
4 Torgersen NA NA NA Adelie NA
5 Torgersen 19.3 193 3450 Adelie 36.7
6 Torgersen 20.6 190 3650 Adelie 39.3
7 Torgersen 17.8 181 3625 Adelie 38.9
8 Torgersen 19.6 195 4675 Adelie 39.2
9 Torgersen 18.1 193 3475 Adelie 34.1
10 Torgersen 20.2 190 4250 Adelie 42
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
@chain penguins begin
@select(island, bill_depth_mm, flipper_length_mm, body_mass_g,
species, bill_length_mm, sex)end
344×7 DataFrame
Row │ island bill_depth_mm flipper_length_mm body_mass_g species bi ⋯
│ String15 Float64? Int64? Int64? String15 Fl ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ Torgersen 18.7 181 3750 Adelie ⋯
2 │ Torgersen 17.4 186 3800 Adelie
3 │ Torgersen 18.0 195 3250 Adelie
4 │ Torgersen missing missing missing Adelie
5 │ Torgersen 19.3 193 3450 Adelie ⋯
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
340 │ Dream 19.8 207 4000 Chinstrap
341 │ Dream 18.1 202 3400 Chinstrap
342 │ Dream 18.2 193 3775 Chinstrap
343 │ Dream 19.0 210 4100 Chinstrap ⋯
344 │ Dream 18.7 198 3775 Chinstrap
2 columns and 334 rows omitted
Work with groups of rows
Simple summaries by group
Let’s suppose we want to compute summaries by groups such as means or medians. Both packages are virtually the same again: on the R side you have dplyr::group_by()
and dplyr::summarize()
, while on the Julia side you have TidierData.@group_by()
and TidierData.@summarize()
. Note that you have to include the skipmissing()
wrapper in order to drop missing values in the mean()
function.
Moreover, dplyr
also automatically arranges the results by the group, so the reproduce the results of dplyr
, we need to add TidierData.@arrange()
to the chain.
|>
penguins group_by(island) |>
summarize(bill_depth_mean = mean(bill_depth_mm, na.rm = TRUE))
# A tibble: 3 × 2
island bill_depth_mean
<fct> <dbl>
1 Biscoe 15.9
2 Dream 18.3
3 Torgersen 18.4
@chain penguins begin
@group_by(island)
@summarize(bill_depth_mean = mean(skipmissing(bill_depth_mm)))
@arrange(island)
end
3×2 DataFrame
Row │ island bill_depth_mean
│ String15 Float64
─────┼────────────────────────────
1 │ Biscoe 15.8749
2 │ Dream 18.3444
3 │ Torgersen 18.4294
More complicated summaries by group
Typically, you want to create multiple different summaries by groups. dplyr
provides a lot of flexibility to create new variables on the fly, as does TidierData
. For instance, we can pass expressions to them mean functions in order to create the share of female penguins per island in the summary statement.
|>
penguins group_by(island) |>
summarize(count = n(),
bill_depth_mean = mean(bill_depth_mm, na.rm = TRUE),
flipper_length_median = median(flipper_length_mm, na.rm = TRUE),
body_mass_sd = sd(body_mass_g, na.rm = TRUE),
share_female = mean(sex == "female", na.rm = TRUE))
# A tibble: 3 × 6
island count bill_depth_mean flipper_length_median body_mass_sd share_female
<fct> <int> <dbl> <dbl> <dbl> <dbl>
1 Biscoe 168 15.9 214 783. 0.491
2 Dream 124 18.3 193 417. 0.496
3 Torgers… 52 18.4 191 445. 0.511
@chain penguins begin
@group_by(island)
@summarize(count = n(),
= mean(skipmissing(bill_depth_mm)),
bill_depth_mean = median(skipmissing(flipper_length_mm)),
flipper_length_median = std(skipmissing(body_mass_g)),
body_mass_sd = mean(skipmissing(sex == "female")))
share_female @arrange(island)
end
3×6 DataFrame
Row │ island count bill_depth_mean flipper_length_median body_mass_sd ⋯
│ String15 Int64 Float64 Float64 Float64 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ Biscoe 168 15.8749 214.0 782.856 ⋯
2 │ Dream 124 18.3444 193.0 416.644
3 │ Torgersen 52 18.4294 191.0 445.108
1 column omitted
Conclusion
This post highlights syntactic similarities and differences across R’s dplyr
and Julia’s TidierData
packages. The key difference is between pipes and chains: dplyr
uses the pipe operator |>
to chain functions, while TidierData
uses the @chain df begin ... end
syntax for piping a value through a series of transformation expressions. Nonetheless, the similarities are remarkable and demonstrate the flexibility of Julia to seemingly replicate the NSE capabilities of R. If you want to play around with Julia or some of its packages, I can highly recommend to take a shortcut using TidierData
.
Footnotes
See the unifying principles of the tidyverse: https://design.tidyverse.org/unifying.html.↩︎
Broadcasting expands singleton dimensions in array arguments to match the corresponding dimension in the other array without using extra memory.↩︎
In Julia, setting the number of rows to display for all DataFrames globally isn’t straightforward as there isn’t a direct global setting for this in the
DataFrame
package.ENV["COLUMNS"]
orENV["LINES"]
control the display based on the size of your terminal.↩︎