Tidy Data Manipulation: dplyr vs pandas

R
Python
Manipulation
A comparison of R’s dplyr and Python’s pandas data manipulation packages
Author

Christoph Scheuch

Published

January 7, 2024

There are a myriad of options to perform essential data manipulation tasks in R and Python (see, for instance, my other posts on dplyr vs ibis and dplyr vs polars). However, if we want to do tidy data science in R, there is a clear forerunner: dplyr. In the world of Python, pandas is the most popular data analysis library. In this blog post, I illustrate their syntactic similarities and highlight differences between these two packages that emerge for a few key tasks.

Before we dive into the comparison, a short introduction to the packages: the dplyr package in R allows users to refer to columns without quotation marks due to its implementation of non-standard evaluation (NSE). NSE is a programming technique used in R that allows functions to capture the expressions passed to them as arguments, rather than just the values of those arguments. The primary goal of NSE in the context of dplyr is to create a more user-friendly and intuitive syntax. This makes data manipulation tasks more straightforward and aligns with the general philosophy of the tidyverse to make data science faster, easier, and more fun.1

pandas is also designed for data analysis and provides a comprehensive range of functionalities for data manipulation and it is designed to efficiently handle in-memory data. The package has a large community, given Python’s popularity in various fields. The learning curve might be steeper for beginners due to Python’s general-purpose nature and the verbosity of pandas syntax, but it integrates well with web apps, machine learning models, etc.

pandas syntax

If you are a seasoned pandas user, you might find my approach non-Pythonic for at least two deliberate choices that I made: (i) I prefer to have chainable methods over using verbose expressions over multiple lines; and (ii) I prefer to have whitespace around all equal signs, not just for variable assignments. In my view, these two choices improve the compliance of pandas with tidy coding principles by increasing the readability of code.

Loading packages and data

We start by loading the main packages of interest and the popular palmerpenguins package that exists for both R and Python. We then use the penguins data frame as the data to compare all functions and methods below.

library(dplyr)
library(palmerpenguins)

penguins <- palmerpenguins::penguins
import pandas as pd
from palmerpenguins import load_penguins

penguins = load_penguins()

Work with rows

Filter rows

Filtering rows with dplyr is based on NSE and the dplyr::filter() function. To replicate the same results with pandas, you can use pandas.query() method which accepts a string with the filter conditions as input.

penguins |> 
  filter(species == "Adelie" & 
           island %in% c("Biscoe", "Dream"))
# A tibble: 100 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Adelie  Biscoe           37.8          18.3               174        3400
 2 Adelie  Biscoe           37.7          18.7               180        3600
 3 Adelie  Biscoe           35.9          19.2               189        3800
 4 Adelie  Biscoe           38.2          18.1               185        3950
 5 Adelie  Biscoe           38.8          17.2               180        3800
 6 Adelie  Biscoe           35.3          18.9               187        3800
 7 Adelie  Biscoe           40.6          18.6               183        3550
 8 Adelie  Biscoe           40.5          17.9               187        3200
 9 Adelie  Biscoe           37.9          18.6               172        3150
10 Adelie  Biscoe           40.5          18.9               180        3950
# ℹ 90 more rows
# ℹ 2 more variables: sex <fct>, year <int>
(penguins
  .query("species == 'Adelie' and island in ['Biscoe', 'Dream']")
)
    species  island  bill_length_mm  ...  body_mass_g     sex  year
20   Adelie  Biscoe            37.8  ...       3400.0  female  2007
21   Adelie  Biscoe            37.7  ...       3600.0    male  2007
22   Adelie  Biscoe            35.9  ...       3800.0  female  2007
23   Adelie  Biscoe            38.2  ...       3950.0    male  2007
24   Adelie  Biscoe            38.8  ...       3800.0    male  2007
..      ...     ...             ...  ...          ...     ...   ...
147  Adelie   Dream            36.6  ...       3475.0  female  2009
148  Adelie   Dream            36.0  ...       3450.0  female  2009
149  Adelie   Dream            37.8  ...       3750.0    male  2009
150  Adelie   Dream            36.0  ...       3700.0  female  2009
151  Adelie   Dream            41.5  ...       4000.0    male  2009

[100 rows x 8 columns]

Slice rows

dplyr::slice() takes integers with row numbers as inputs, so you can use ranges and arbitrary vectors of integers. pandas.iloc[] also provides a function for integer-location based indexing (note that indexing starts at 0 in Python, while it starts at 1 in R). Note that pandas.iloc[] requires square brackets instead of parentheses.

penguins |> 
  slice(10:20)
# A tibble: 11 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           42            20.2               190        4250
 2 Adelie  Torgersen           37.8          17.1               186        3300
 3 Adelie  Torgersen           37.8          17.3               180        3700
 4 Adelie  Torgersen           41.1          17.6               182        3200
 5 Adelie  Torgersen           38.6          21.2               191        3800
 6 Adelie  Torgersen           34.6          21.1               198        4400
 7 Adelie  Torgersen           36.6          17.8               185        3700
 8 Adelie  Torgersen           38.7          19                 195        3450
 9 Adelie  Torgersen           42.5          20.7               197        4500
10 Adelie  Torgersen           34.4          18.4               184        3325
11 Adelie  Torgersen           46            21.5               194        4200
# ℹ 2 more variables: sex <fct>, year <int>
(penguins
  .iloc[9:20]
)
   species     island  bill_length_mm  ...  body_mass_g     sex  year
9   Adelie  Torgersen            42.0  ...       4250.0     NaN  2007
10  Adelie  Torgersen            37.8  ...       3300.0     NaN  2007
11  Adelie  Torgersen            37.8  ...       3700.0     NaN  2007
12  Adelie  Torgersen            41.1  ...       3200.0  female  2007
13  Adelie  Torgersen            38.6  ...       3800.0    male  2007
14  Adelie  Torgersen            34.6  ...       4400.0    male  2007
15  Adelie  Torgersen            36.6  ...       3700.0  female  2007
16  Adelie  Torgersen            38.7  ...       3450.0  female  2007
17  Adelie  Torgersen            42.5  ...       4500.0    male  2007
18  Adelie  Torgersen            34.4  ...       3325.0  female  2007
19  Adelie  Torgersen            46.0  ...       4200.0    male  2007

[11 rows x 8 columns]

Arrange rows

To orders the rows of a data frame by the values of selected columns, we have dplyr::arrange() and pandas.sort_values(). Note that both approaches arrange rows in an an ascending order and puts missing values last as defaults.

penguins |> 
  arrange(island, desc(bill_length_mm))
# A tibble: 344 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo  Biscoe           59.6          17                 230        6050
 2 Gentoo  Biscoe           55.9          17                 228        5600
 3 Gentoo  Biscoe           55.1          16                 230        5850
 4 Gentoo  Biscoe           54.3          15.7               231        5650
 5 Gentoo  Biscoe           53.4          15.8               219        5500
 6 Gentoo  Biscoe           52.5          15.6               221        5450
 7 Gentoo  Biscoe           52.2          17.1               228        5400
 8 Gentoo  Biscoe           52.1          17                 230        5550
 9 Gentoo  Biscoe           51.5          16.3               230        5500
10 Gentoo  Biscoe           51.3          14.2               218        5300
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
(penguins
  .sort_values(by = ["island", "bill_length_mm"], 
               ascending = [True, False])
)
    species     island  bill_length_mm  ...  body_mass_g     sex  year
185  Gentoo     Biscoe            59.6  ...       6050.0    male  2007
253  Gentoo     Biscoe            55.9  ...       5600.0    male  2009
267  Gentoo     Biscoe            55.1  ...       5850.0    male  2009
215  Gentoo     Biscoe            54.3  ...       5650.0    male  2008
259  Gentoo     Biscoe            53.4  ...       5500.0    male  2009
..      ...        ...             ...  ...          ...     ...   ...
80   Adelie  Torgersen            34.6  ...       3200.0  female  2008
18   Adelie  Torgersen            34.4  ...       3325.0  female  2007
8    Adelie  Torgersen            34.1  ...       3475.0     NaN  2007
70   Adelie  Torgersen            33.5  ...       3600.0  female  2008
3    Adelie  Torgersen             NaN  ...          NaN     NaN  2007

[344 rows x 8 columns]

Work with columns

Select columns

Selecting a subset of columns works very similarly withdplyr::select() and pandas.get(). The former accepts column names using NSE (or vectors of charaters), while the latter requires a vector of strings with column names as inputs.

penguins |> 
  select(bill_length_mm, sex)
# A tibble: 344 × 2
   bill_length_mm sex   
            <dbl> <fct> 
 1           39.1 male  
 2           39.5 female
 3           40.3 female
 4           NA   <NA>  
 5           36.7 female
 6           39.3 male  
 7           38.9 female
 8           39.2 male  
 9           34.1 <NA>  
10           42   <NA>  
# ℹ 334 more rows
(penguins
  .get(["bill_length_mm", "sex"])
)
     bill_length_mm     sex
0              39.1    male
1              39.5  female
2              40.3  female
3               NaN     NaN
4              36.7  female
..              ...     ...
339            55.8    male
340            43.5  female
341            49.6    male
342            50.8    male
343            50.2  female

[344 rows x 2 columns]

Renaming columns

Renaming columns also works very similarly with the major difference that pandas.rename() takes a dictionary with mappings of old to new names as input, while dplyr::rename() takes variable names via the usual NSE.

penguins |> 
  rename(bill_length = bill_length_mm,
         bill_depth = bill_depth_mm)
# A tibble: 344 × 8
   species island    bill_length bill_depth flipper_length_mm body_mass_g sex   
   <fct>   <fct>           <dbl>      <dbl>             <int>       <int> <fct> 
 1 Adelie  Torgersen        39.1       18.7               181        3750 male  
 2 Adelie  Torgersen        39.5       17.4               186        3800 female
 3 Adelie  Torgersen        40.3       18                 195        3250 female
 4 Adelie  Torgersen        NA         NA                  NA          NA <NA>  
 5 Adelie  Torgersen        36.7       19.3               193        3450 female
 6 Adelie  Torgersen        39.3       20.6               190        3650 male  
 7 Adelie  Torgersen        38.9       17.8               181        3625 female
 8 Adelie  Torgersen        39.2       19.6               195        4675 male  
 9 Adelie  Torgersen        34.1       18.1               193        3475 <NA>  
10 Adelie  Torgersen        42         20.2               190        4250 <NA>  
# ℹ 334 more rows
# ℹ 1 more variable: year <int>
(penguins
  .rename(columns = {"bill_length_mm": "bill_length",
                     "bill_depth_mm" : "bill_depth"})
)
       species     island  bill_length  ...  body_mass_g     sex  year
0       Adelie  Torgersen         39.1  ...       3750.0    male  2007
1       Adelie  Torgersen         39.5  ...       3800.0  female  2007
2       Adelie  Torgersen         40.3  ...       3250.0  female  2007
3       Adelie  Torgersen          NaN  ...          NaN     NaN  2007
4       Adelie  Torgersen         36.7  ...       3450.0  female  2007
..         ...        ...          ...  ...          ...     ...   ...
339  Chinstrap      Dream         55.8  ...       4000.0    male  2009
340  Chinstrap      Dream         43.5  ...       3400.0  female  2009
341  Chinstrap      Dream         49.6  ...       3775.0    male  2009
342  Chinstrap      Dream         50.8  ...       4100.0    male  2009
343  Chinstrap      Dream         50.2  ...       3775.0  female  2009

[344 rows x 8 columns]

Mutate columns

Transforming existing columns or creating new ones is an essential part of data analysis. dplyr::mutate() and pandas.assign() are the work horses for these tasks. While dplyr starts with column names before the expressions that transform columns, pandas uses the lambda function to assign expressions to new columns. Note that you have to split up variable assignments if you want to refer to a newly created variable in pandas, while you can refer to the new variables in the same mutate block in dplyr.

penguins |> 
  mutate(ones = 1,
         bill_length = bill_length_mm / 10,
         bill_length_squared = bill_length^2) |> 
  select(ones, bill_length_mm, bill_length, bill_length_squared)
# A tibble: 344 × 4
    ones bill_length_mm bill_length bill_length_squared
   <dbl>          <dbl>       <dbl>               <dbl>
 1     1           39.1        3.91                15.3
 2     1           39.5        3.95                15.6
 3     1           40.3        4.03                16.2
 4     1           NA         NA                   NA  
 5     1           36.7        3.67                13.5
 6     1           39.3        3.93                15.4
 7     1           38.9        3.89                15.1
 8     1           39.2        3.92                15.4
 9     1           34.1        3.41                11.6
10     1           42          4.2                 17.6
# ℹ 334 more rows
(penguins 
  .assign(ones = 1,
          bill_length = lambda x: x["bill_length_mm"] / 10)
  .assign(bill_length_squared = lambda x: x["bill_length"] ** 2)
  .get(["ones", "bill_length_mm", "bill_length", "bill_length_squared"])
)
     ones  bill_length_mm  bill_length  bill_length_squared
0       1            39.1         3.91              15.2881
1       1            39.5         3.95              15.6025
2       1            40.3         4.03              16.2409
3       1             NaN          NaN                  NaN
4       1            36.7         3.67              13.4689
..    ...             ...          ...                  ...
339     1            55.8         5.58              31.1364
340     1            43.5         4.35              18.9225
341     1            49.6         4.96              24.6016
342     1            50.8         5.08              25.8064
343     1            50.2         5.02              25.2004

[344 rows x 4 columns]

Relocate columns

dplyr::relocate() provides options to change the positions of columns in a data frame, using the same syntax as dplyr::select(). In addition, there are the options .after and .before to provide users with additional shortcuts.

The recommended way to relocate columns in pandas is to use the pandas.get() method, but there are no options as in dplyr::relocate(). In fact, the safest way to consistently get the correct order of columns is to explicitly specify them.

penguins |> 
  relocate(c(species, bill_length_mm), .before = sex)
# A tibble: 344 × 8
   island    bill_depth_mm flipper_length_mm body_mass_g species bill_length_mm
   <fct>             <dbl>             <int>       <int> <fct>            <dbl>
 1 Torgersen          18.7               181        3750 Adelie            39.1
 2 Torgersen          17.4               186        3800 Adelie            39.5
 3 Torgersen          18                 195        3250 Adelie            40.3
 4 Torgersen          NA                  NA          NA Adelie            NA  
 5 Torgersen          19.3               193        3450 Adelie            36.7
 6 Torgersen          20.6               190        3650 Adelie            39.3
 7 Torgersen          17.8               181        3625 Adelie            38.9
 8 Torgersen          19.6               195        4675 Adelie            39.2
 9 Torgersen          18.1               193        3475 Adelie            34.1
10 Torgersen          20.2               190        4250 Adelie            42  
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
(penguins
  .get(["island", "bill_depth_mm", "flipper_length_mm", "body_mass_g",
        "species", "bill_length_mm", "sex"])
)
        island  bill_depth_mm  ...  bill_length_mm     sex
0    Torgersen           18.7  ...            39.1    male
1    Torgersen           17.4  ...            39.5  female
2    Torgersen           18.0  ...            40.3  female
3    Torgersen            NaN  ...             NaN     NaN
4    Torgersen           19.3  ...            36.7  female
..         ...            ...  ...             ...     ...
339      Dream           19.8  ...            55.8    male
340      Dream           18.1  ...            43.5  female
341      Dream           18.2  ...            49.6    male
342      Dream           19.0  ...            50.8    male
343      Dream           18.7  ...            50.2  female

[344 rows x 7 columns]

Work with groups of rows

Simple summaries by group

Let’s suppose we want to compute summaries by groups such as means or medians. Both packages are very similar again: on the R side you have dplyr::group_by() and dplyr::summarize(), while on the Python side you have pandas.groupby() and pandas.agg().

Note that dplyr::groupby() also automatically arranges the results by the group, so the reproduce the results of dplyr, we need to add pandas.sort() to the chain.

penguins |> 
  group_by(island) |> 
  summarize(bill_depth_mean = mean(bill_depth_mm, na.rm = TRUE))
# A tibble: 3 × 2
  island    bill_depth_mean
  <fct>               <dbl>
1 Biscoe               15.9
2 Dream                18.3
3 Torgersen            18.4
(penguins
  .groupby("island")
  .agg(bill_depth_mean = ("bill_depth_mm", "mean"))
)
           bill_depth_mean
island                    
Biscoe           15.874850
Dream            18.344355
Torgersen        18.429412

More complicated summaries by group

Typically, you want to create multiple different summaries by groups. dplyr provides a lot of flexibility to create new variables on the fly, as does pandas. For instance, we can pass expressions to them mean functions in order to create the share of female penguins per island in the summary statement. Note that you again have to use lambda functions in pandas.

penguins |> 
  group_by(island) |> 
  summarize(count = n(),
            bill_depth_mean = mean(bill_depth_mm, na.rm = TRUE),
            flipper_length_median = median(flipper_length_mm, na.rm = TRUE),
            body_mass_sd = sd(body_mass_g, na.rm = TRUE),
            share_female = mean(sex == "female", na.rm = TRUE))
# A tibble: 3 × 6
  island   count bill_depth_mean flipper_length_median body_mass_sd share_female
  <fct>    <int>           <dbl>                 <dbl>        <dbl>        <dbl>
1 Biscoe     168            15.9                   214         783.        0.491
2 Dream      124            18.3                   193         417.        0.496
3 Torgers…    52            18.4                   191         445.        0.511
(penguins
  .groupby("island")
  .agg(count = ("island", "size"),
       bill_depth_mean = ("bill_depth_mm", "mean"),
       flipper_length_median = ("flipper_length_mm", "median"),
       body_mass_sd = ("body_mass_g", "std"),
       share_female = ("sex", lambda x: (x == "female").mean))
)
           count  ...                                       share_female
island            ...                                                   
Biscoe       168  ...  <bound method Series.mean of 20      True\n21 ...
Dream        124  ...  <bound method Series.mean of 30      True\n31 ...
Torgersen     52  ...  <bound method Series.mean of 0      False\n1  ...

[3 rows x 5 columns]

Conclusion

This post highlights syntactic similarities and differences across R’s dplyr and Python’s pandas packages. Two key points emerge: (i) dplyr heavily relies on NSE to enable a syntax that refrains from using strings, something that is not possible in Python; (ii) the structure of inputs to pandas methods is inconsistent compared to dplyr (sometimes inputs are vectors of strings, sometimes just a single string, sometimes dictionaries, etc.). I want to close this post by emphasizing that both languages and packages have their own merits and supporters. I personally find it hard to remember the syntax of each pandas method, so I’m much more prone to on-the-fly coding errors than compared to dplyr.

Footnotes

  1. See the unifying principles of the tidyverse: https://design.tidyverse.org/unifying.html.↩︎