Tidy Data: Tabular Data Storage Comparison

R
Python
Data
A comparison of popular open-source data storage technologies using R and Python
Author

Christoph Scheuch

Published

January 20, 2024

Sharing data between different collaborators, machines, or programming languages can be cumbersome for many reasons. In this post, I look into the issue of column types and how different storage technologies handle them. I focus on self-contained technologies that are easy to install and run on your machine without setting up a separate backend server. This requirement typically arises in academic contexts, educational settings, or when you quickly want to prototype something without spending time on setting up a data backend.

I start with simple CSV, then move on to the popular SQLite database before I look at the rising star DuckDB. We close the comparison with a look at the Parquet and Feather file formats. I always check how the column type depends on the language that is used to store the data in the corresponding storage technology.

Create example data

Let us start this blog post by creating some example data frame with the most important column types that one typically encounters in data analysis. The code chunks create the example data. Note that I create a little helper function that allows us to quickly extract the column types for a data frame, while we can use the .dtypes method in pandas to achieve the same.

library(dplyr)

data_r <- tibble(
  character_column = c("A", "B", "C", "D"), 
  date_column = as.Date(c("2023-01-01", "2023-02-01", "2023-03-01", "2023-04-01")),
  datetime_column = as.POSIXct(c("2023-01-01 10:00:00", "2023-02-01 11:00:00", "2023-03-01 12:00:00", "2023-04-01 13:00:00")),
  numeric_column = c(1.5, 2.5, 3.5, 4.5),
  integer_column = as.integer(c(1, 2, 3, 4)),
  logical_column = c(TRUE, FALSE, FALSE, TRUE)
)

extract_column_classes <- function(df) {
  sapply(sapply(df, class), function(x) paste(x, collapse = ", "))
}

tibble("data_r" = extract_column_classes(data_r))
# A tibble: 6 × 1
  data_r         
  <chr>          
1 character      
2 Date           
3 POSIXct, POSIXt
4 numeric        
5 integer        
6 logical        
import pandas as pd

data_python = pd.DataFrame({
  "character_column": ["A", "B", "C", "D"],
  "date_column": pd.to_datetime(["2023-01-01", "2023-02-01", "2023-03-01", "2023-04-01"]),
  "datetime_column": pd.to_datetime(["2023-01-01 10:00:00", "2023-02-01 11:00:00", "2023-03-01 12:00:00", "2023-04-01 13:00:00"]),
  "numeric_column": [1.5, 2.5, 3.5, 4.5],
  "integer_column": [1, 2, 3, 4],
  "logical_column": [True, False, False, True]
})

data_python.dtypes
character_column            object
date_column         datetime64[ns]
datetime_column     datetime64[ns]
numeric_column             float64
integer_column               int64
logical_column                bool
dtype: object

The following table shows the mapping between the different column types in the example above.

R Python
character object
date datetime
datetime datetime
numeric float
integer integer
logical bool

A few differences immediately arise:

  • pandas uses the object type to denote character columns.1
  • pandas does not have a date type and rather reuses datetime with a different format.

The other column types might have different names, but they are essentially equivalent.

CSV

CSV (Comma-Separated Values) files are a popular data format, renowned for their simplicity, widespread support, and ease of use. Their simple structure, consisting of plain text with values separated by commas, makes them an accessible choice for data storage and exchange.

However, CSV files have their limitations. They are not the most efficient format for handling large datasets, as they can become cumbersome and slow to process. Additionally, CSVs are limited in their ability to handle complex data structures or to store metadata. They lack capabilities for data typing, hierarchy, and relationships, which are essential for more complex data management needs. This limitation makes them less suitable for applications requiring advanced data storage and retrieval functionalities.

I use the readr package to write and read CSV files in this post. The nice thing is that readr actually preserves the column types through its powerful built-in parser.

library(readr)

write_csv(data_r, file = "data_r.csv")
data_r_csv <- read_csv("data_r.csv")

glimpse(data_r_csv)
Rows: 4
Columns: 6
$ character_column <chr> "A", "B", "C", "D"
$ date_column      <date> 2023-01-01, 2023-02-01, 2023-03-01, 2023-04-01
$ datetime_column  <dttm> 2023-01-01 09:00:00, 2023-02-01 10:00:00, 2023-03-01 …
$ numeric_column   <dbl> 1.5, 2.5, 3.5, 4.5
$ integer_column   <dbl> 1, 2, 3, 4
$ logical_column   <lgl> TRUE, FALSE, FALSE, TRUE

Thanks to the smart parsing capabilities of read_csv(), the column types are even aligned between R and Python when using the CSV file format.

data_python_csv <- read_csv("data_python.csv")

tibble(
  "data_r_csv" = extract_column_classes(data_r_csv),
  "data_python_csv" = extract_column_classes(data_python_csv)
)
# A tibble: 6 × 2
  data_r_csv      data_python_csv
  <chr>           <chr>          
1 character       character      
2 Date            Date           
3 POSIXct, POSIXt POSIXct, POSIXt
4 numeric         numeric        
5 numeric         numeric        
6 logical         logical        

To write and read CSV files, I use the corresponding pandas methods. To parse the dates correctly when reading in data, you actually need to add parse_dates = ["date_column", "datetime_column"] to pandas.read_csv(). If you don’t specify this, then pandas automatically reads dates as strings.

data_python.to_csv("data_python.csv", index = False)
data_python_csv = pd.read_csv("data_python.csv")

data_python_csv.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   character_column  4 non-null      object 
 1   date_column       4 non-null      object 
 2   datetime_column   4 non-null      object 
 3   numeric_column    4 non-null      float64
 4   integer_column    4 non-null      int64  
 5   logical_column    4 non-null      bool   
dtypes: bool(1), float64(1), int64(1), object(3)
memory usage: 292.0+ bytes

The missing date parsing specifications actually leads to an alignment with data that was written with R because dates are just handled as strings.

data_r_csv = pd.read_csv("data_r.csv")

pd.DataFrame({
  "data_r_csv": data_r_csv.dtypes, 
  "data_python_csv": data_python_csv.dtypes
})
                 data_r_csv data_python_csv
character_column     object          object
date_column          object          object
datetime_column      object          object
numeric_column      float64         float64
integer_column        int64           int64
logical_column         bool            bool

If you are willing to explicitly specify the date columns in Python, then CSV is actually a valid option to transfer data between R and Python. We could even open CSV files in Excel and manipulate them there, so the simplicity is quite striking.

If you work with large amounts of data or if you want to parallelize read or write operations on the same file, then CSV is definitely not for you. On the one hand, it is hard to read only parts of a CSV file (e.g., only rows that fulfill certain criteria). On the other hand, you will most likely run into all kinds of concurrency issues due to file locking.

With these issues in mind, let us move on to more advanced storage technologies.

SQLite

SQLite is a highly regarded, lightweight, file-based SQL database that stands out for its simplicity and ease of use. As a self-contained, serverless database engine, it provides a robust platform for storing and managing data without the need for a separate server setup. This feature makes SQLite exceptionally suitable for applications that require an embedded database system or for development purposes.

However, SQLite has its limitations. It is not typically recommended for very large-scale applications or those requiring high levels of concurrency. SQLite handles concurrency at a basic level but is not optimized for situations where numerous processes need to write to the database simultaneously. For such applications, more robust database systems like PostgreSQL or MySQL are generally preferred. Nonetheless, for smaller applications, educational purposes, and situations where a lightweight and easy-to-deploy database is needed, SQLite is an excellent choice.

The RSQLite package provides a seamless integration of SQLite databases into R projects. The typical workflow starts with initializing a connection to a database (which also creates an SQLite database if it does not exist). To check how SQLite treats our column types, I just write the data to the local database and then read it again into a different data frame. Note that it is also good practice to close database connections again after you are done with whatever you need from it (in particular when multiple processes target the same database).

We can see that the column types are not automatically preserved, since dates, datetime and logical columns are converted to integers.2 Note that you can use the extended_types = TRUE option in dbConnect() if you want your SQLite database to handle dates correctly in R (the other programming languages don’t care about this option).

library(RSQLite)

con_sqlite_r <- dbConnect(SQLite(), "data_r.sqlite")
dbWriteTable(con_sqlite_r, "data", data_r, overwrite = TRUE)
data_r_sqlite <- dbReadTable(con_sqlite_r, "data")
dbDisconnect(con_sqlite_r)

glimpse(data_r_sqlite)
Rows: 4
Columns: 6
$ character_column <chr> "A", "B", "C", "D"
$ date_column      <dbl> 19358, 19389, 19417, 19448
$ datetime_column  <dbl> 1672563600, 1675245600, 1677668400, 1680346800
$ numeric_column   <dbl> 1.5, 2.5, 3.5, 4.5
$ integer_column   <int> 1, 2, 3, 4
$ logical_column   <int> 1, 0, 0, 1

If we now read from the SQLite databases created in Python, we can see that things get even more complicated: pandas stores dates as characters, so you need to convert them to dates or datetime.

con_sqlite_python <- dbConnect(SQLite(), "data_python.sqlite")
data_python_sqlite <- dbReadTable(con_sqlite_python, "data")
dbDisconnect(con_sqlite_python)

tibble(
  "data_r_sqlite" = extract_column_classes(data_r_sqlite),
  "data_python_sqlite" = extract_column_classes(data_python_sqlite)
)
# A tibble: 6 × 2
  data_r_sqlite data_python_sqlite
  <chr>         <chr>             
1 character     character         
2 numeric       character         
3 numeric       character         
4 numeric       numeric           
5 integer       integer           
6 integer       integer           

I use the sqlite3 package to perform the same tasks as in R: create a database (or connect to an existing one), write the data to the database, and read the data again to

Similar to CSV, you have to add the parse_dates = ["date_column", "datetime_column"] option to correctly read back in dates using pandas.read_sql_query(). Otherwise dates are simply string columns.

import sqlite3

con_sqlite_python = sqlite3.connect("data_python.sqlite")
res = data_python.to_sql("data", con_sqlite_python, if_exists = "replace", index = False)
data_python_sqlite = pd.read_sql_query("SELECT * FROM data", con_sqlite_python)
con_sqlite_python.close()

data_python_sqlite.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   character_column  4 non-null      object 
 1   date_column       4 non-null      object 
 2   datetime_column   4 non-null      object 
 3   numeric_column    4 non-null      float64
 4   integer_column    4 non-null      int64  
 5   logical_column    4 non-null      int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 320.0+ bytes

When I compare the SQLite data to the database that I have created in R, we can see that dates are handled differently because R uses integers to store date and datetime, which are actually interpreted as numeric by Python.

con_sqlite_r = sqlite3.connect("data_r.sqlite")
data_r_sqlite = pd.read_sql_query("SELECT * FROM data", con_sqlite_r)
con_sqlite_r.close()

pd.DataFrame({
  "data_r_sqlite": data_r_sqlite.dtypes, 
  "data_python_sqlite": data_python_sqlite.dtypes
})
                 data_r_sqlite data_python_sqlite
character_column        object             object
date_column            float64             object
datetime_column        float64             object
numeric_column         float64            float64
integer_column           int64              int64
logical_column           int64              int64

If you stay in the same programming language and only work with up to a couple of giga bytes of data, then SQLite is a great database. However, if you want to pull data from the same data base in different programming languages, then you have to be really careful with respect to column types.

DuckDB

DuckDB is an emerging database management system that is gaining attention for its remarkable efficiency and user-friendliness. It is specifically designed to be an OLAP (Online Analytical Processing) database, making it well-suited for executing analytical queries on large datasets. This focus positions DuckDB as a highly efficient tool for data analytics. One of the standout features of DuckDB is its ability to run directly within data analysis environments. It integrates seamlessly with popular data science languages and tools, allowing data scientists to perform analysis within their familiar programming environment.

However, being a relatively new addition to the world of database management systems, DuckDB might not yet offer the same level of community support, breadth of tools, and integrations as more established databases (such as SQLite).

The duckdb package contains everything we need to set up the database and execute the same steps as with SQLite. If I write the data to and then read it back in, I get the same column types again - this is great!

library(duckdb)

con_duckdb_r <- dbConnect(duckdb(), "data_r.duckdb")
dbWriteTable(con_duckdb_r, "data", data_r, overwrite = TRUE)
data_r_duckdb <- dbReadTable(con_duckdb_r, "data")
dbDisconnect(con_duckdb_r, shutdown = TRUE)

glimpse(data_r_duckdb)
Rows: 4
Columns: 6
$ character_column <chr> "A", "B", "C", "D"
$ date_column      <date> 2023-01-01, 2023-02-01, 2023-03-01, 2023-04-01
$ datetime_column  <dttm> 2023-01-01 09:00:00, 2023-02-01 10:00:00, 2023-03-01 …
$ numeric_column   <dbl> 1.5, 2.5, 3.5, 4.5
$ integer_column   <int> 1, 2, 3, 4
$ logical_column   <lgl> TRUE, FALSE, FALSE, TRUE

If I read data from the database create in Python, we only see a small difference: while we have Date in R, we get a POSIXct column from the Python database. For many applications, this difference should have no impact. However, if you work with time series data, make sure to have the correct date format before you proceed!

con_duckdb_python <- dbConnect(duckdb(), "data_python.duckdb")
data_python_duckdb <- dbReadTable(con_duckdb_python, "data")

tibble(
  "data_r_duckdb" = extract_column_classes(data_r_duckdb),
  "data_python_duckdb" = extract_column_classes(data_python_duckdb)
)
# A tibble: 6 × 2
  data_r_duckdb   data_python_duckdb
  <chr>           <chr>             
1 character       character         
2 Date            POSIXct, POSIXt   
3 POSIXct, POSIXt POSIXct, POSIXt   
4 numeric         numeric           
5 integer         integer           
6 logical         integer           

I use the duckdb package to create the database and the sqlalchemy package to handle the connection. Note that I don’t have to disconnect manually because the engine is taking care of this automatically. Again, the great thing is that we have the same column types for the data that I put into and the data that I have pulled out of the database!

import duckdb
from sqlalchemy import create_engine

con_duckdb_python = create_engine("duckdb:///data_python.duckdb")
res = data_python.to_sql("data", con_duckdb_python, if_exists = "replace", index = False)
data_python_duckdb = pd.read_sql_query("SELECT * FROM data", con_duckdb_python)

data_python_duckdb.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   character_column  4 non-null      object        
 1   date_column       4 non-null      datetime64[ns]
 2   datetime_column   4 non-null      datetime64[ns]
 3   numeric_column    4 non-null      float64       
 4   integer_column    4 non-null      int64         
 5   logical_column    4 non-null      bool          
dtypes: bool(1), datetime64[ns](2), float64(1), int64(1), object(1)
memory usage: 292.0+ bytes

If I now read the data from the database generated in R, then we see that the only difference again arises in the date column, but at least the datetime columns are aligned.

con_duckdb_r = create_engine("duckdb:///data_r.duckdb")
data_r_duckdb = pd.read_sql_query("SELECT * FROM data", con_duckdb_r)

pd.DataFrame({
  "data_r_duckdb": data_r_duckdb.dtypes, 
  "data_python_duckdb": data_python_duckdb.dtypes
})
                   data_r_duckdb data_python_duckdb
character_column          object             object
date_column               object     datetime64[ns]
datetime_column   datetime64[ns]     datetime64[ns]
numeric_column           float64            float64
integer_column             int64              int64
logical_column              bool               bool

DuckDB is a great alternative if you want to play around with new data storage technologies and if you work with multiple programming languages. The examples above show a high overlap between column types with the only major issue that the Date type from R is interpreted as a string in Python.

Parquet

Parquet is a highly efficient columnar storage format, increasingly popular in the field of data analytics, particularly when dealing with large datasets. This format is structured in a way that stores data by columns rather than by rows, which is typical in traditional database formats. This columnar approach allows for more efficient data retrieval and scanning, making it an excellent choice for analytical querying where operations are often performed on specific columns of data (similar to DuckDB).

One of the key strengths of Parquet is its ability to compress data effectively. It supports various compression and encoding schemes, which can significantly reduce the storage footprint of large data sets without sacrificing read performance. This efficiency in data storage and compression makes Parquet an attractive option for systems where storage cost and I/O performance are critical considerations, such as in big data applications and cloud environments.

The arrow package provides all we need to write and read Parquet files. Similar to DuckDB, all column types are preserved exactly as to before we wrote the data frame to a parquet file.

library(arrow)

write_parquet(data_r, "data_r.parquet")
data_r_parquet <- read_parquet("data_r.parquet")

glimpse(data_r_parquet)
Rows: 4
Columns: 6
$ character_column <chr> "A", "B", "C", "D"
$ date_column      <date> 2023-01-01, 2023-02-01, 2023-03-01, 2023-04-01
$ datetime_column  <dttm> 2023-01-01 10:00:00, 2023-02-01 11:00:00, 2023-03-01 …
$ numeric_column   <dbl> 1.5, 2.5, 3.5, 4.5
$ integer_column   <int> 1, 2, 3, 4
$ logical_column   <lgl> TRUE, FALSE, FALSE, TRUE

If I now read the data from the database generated in R, then we see that the only difference again arises in the date column (just as with DuckDB).

data_python_parquet <- read_parquet("data_python.parquet")

tibble(
  "data_r_parquet" = extract_column_classes(data_r_parquet),
  "data_python_parquet" = extract_column_classes(data_python_parquet)
)
# A tibble: 6 × 2
  data_r_parquet  data_python_parquet
  <chr>           <chr>              
1 character       character          
2 Date            POSIXct, POSIXt    
3 POSIXct, POSIXt POSIXct, POSIXt    
4 numeric         numeric            
5 integer         integer            
6 logical         logical            

I use the pyarrow.parquet to perform the same write and read operations as in R. As expected, all column types are preserved.

import pyarrow.parquet as pq

data_python.to_parquet("data_python.parquet")
data_python_parquet = pd.read_parquet("data_python.parquet")

data_python_parquet.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   character_column  4 non-null      object        
 1   date_column       4 non-null      datetime64[ns]
 2   datetime_column   4 non-null      datetime64[ns]
 3   numeric_column    4 non-null      float64       
 4   integer_column    4 non-null      int64         
 5   logical_column    4 non-null      bool          
dtypes: bool(1), datetime64[ns](2), float64(1), int64(1), object(1)
memory usage: 292.0+ bytes

If I now read the data from the database generated in R, then we see that the only difference again arises in the date column (just as with DuckDB).

data_r_parquet = pd.read_parquet("data_r.parquet")

pd.DataFrame({
  "data_r_parquet": data_r_parquet.dtypes, 
  "data_python_parquet": data_python_parquet.dtypes
})
                  data_r_parquet data_python_parquet
character_column          object              object
date_column               object      datetime64[ns]
datetime_column   datetime64[us]      datetime64[ns]
numeric_column           float64             float64
integer_column             int32               int64
logical_column              bool                bool

Feather

Feather primarily designed for efficient data interchange between R and Python. Developed jointly by R and Python teams, Feather provides a fast, lightweight, and easy-to-use binary file format for storing data frames. However, Feather is primarily designed as an intermediate file format for data interchange rather than for long-term data storage. It does not offer the same level of compression as formats like Parquet, nor is it optimized for queries and data analysis tasks directly on the stored files.

The arrow package also provides all we need to write and read Feather files. Similar to DuckDB and Parquet, all column types are preserved exactly as to before we wrote the data frame to a parquet file.

write_feather(data_r, "data_r.feather")
data_r_feather <- read_feather("data_r.feather")

glimpse(data_r_feather)
Rows: 4
Columns: 6
$ character_column <chr> "A", "B", "C", "D"
$ date_column      <date> 2023-01-01, 2023-02-01, 2023-03-01, 2023-04-01
$ datetime_column  <dttm> 2023-01-01 10:00:00, 2023-02-01 11:00:00, 2023-03-01 …
$ numeric_column   <dbl> 1.5, 2.5, 3.5, 4.5
$ integer_column   <int> 1, 2, 3, 4
$ logical_column   <lgl> TRUE, FALSE, FALSE, TRUE
data_python_feather <- read_feather("data_python.feather")

tibble(
  "data_r_feather" = extract_column_classes(data_r_feather),
  "data_python_feather" = extract_column_classes(data_python_feather)
)
# A tibble: 6 × 2
  data_r_feather  data_python_feather
  <chr>           <chr>              
1 character       character          
2 Date            POSIXct, POSIXt    
3 POSIXct, POSIXt POSIXct, POSIXt    
4 numeric         numeric            
5 integer         integer            
6 logical         logical            

I use pyarrow.feather to perform the same write and read operations as in R. As expected, all column types are preserved.

import pyarrow.feather as feather

feather.write_feather(data_python, "data_python.feather")
data_python_feather = feather.read_feather("data_python.feather")

data_python_feather.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   character_column  4 non-null      object        
 1   date_column       4 non-null      datetime64[ns]
 2   datetime_column   4 non-null      datetime64[ns]
 3   numeric_column    4 non-null      float64       
 4   integer_column    4 non-null      int64         
 5   logical_column    4 non-null      bool          
dtypes: bool(1), datetime64[ns](2), float64(1), int64(1), object(1)
memory usage: 292.0+ bytes

If I now read the data from the database generated in R, then we see that the only difference again arises in the date column (just as with DuckDB and Parquet).

data_r_feather = feather.read_feather("data_r.feather")

pd.DataFrame({
  "data_r_feather": data_r_feather.dtypes, 
  "data_python_feather": data_python_feather.dtypes
})
                  data_r_feather data_python_feather
character_column          object              object
date_column               object      datetime64[ns]
datetime_column   datetime64[us]      datetime64[ns]
numeric_column           float64             float64
integer_column             int32               int64
logical_column              bool                bool

Conclusion

There are two main takeaways: first, R’s handling of dates is peculiar and cannot be smoothly aligned with Python (at with pandas). I think I need to ask myself the question how often I actually need the Date type if I want to avoid issues with other languages. Second, Parquet is an exciting format for storing data and I’ll definitely use it more in future applications.

Footnotes

  1. There is also a dedicated StringDtype for string data, introduced in more recent versions of pandas (version 1.0.0 and later), but it seems that it is not used often.↩︎

  2. In R, dates are typically represented as the number of days since January 1, 1970, known as the Unix epoch. This is a standard date reference used in many programming languages. The integer is positive for dates after January 1, 1970, and negative for dates before it. For example, a Date object representing January 2, 1970, would have a value of 1, as it is one day after the epoch start. Similarly, POSIXct is a count of the number of seconds since the Unix epoch (January 1, 1970, 00:00:00 GMT).↩︎