A comparison of popular open-source data storage technologies using R and Python
Author
Christoph Scheuch
Published
January 20, 2024
Sharing data between different collaborators, machines, or programming languages can be cumbersome for many reasons. In this post, I look into the issue of column types and how different storage technologies handle them. I focus on self-contained technologies that are easy to install and run on your machine without setting up a separate backend server. This requirement typically arises in academic contexts, educational settings, or when you quickly want to prototype something without spending time on setting up a data backend.
I start with simple CSV, then move on to the popular SQLite database before I look at the rising star DuckDB. We close the comparison with a look at the Parquet and Feather file formats. I always check how the column type depends on the language that is used to store the data in the corresponding storage technology.
Create example data
Let us start this blog post by creating some example data frame with the most important column types that one typically encounters in data analysis. The code chunks create the example data. Note that I create a little helper function that allows us to quickly extract the column types for a data frame, while we can use the .dtypes method in pandas to achieve the same.
The following table shows the mapping between the different column types in the example above.
R
Python
character
object
date
datetime
datetime
datetime
numeric
float
integer
integer
logical
bool
A few differences immediately arise:
pandas uses the object type to denote character columns.1
pandas does not have a date type and rather reuses datetime with a different format.
The other column types might have different names, but they are essentially equivalent.
CSV
CSV (Comma-Separated Values) files are a popular data format, renowned for their simplicity, widespread support, and ease of use. Their simple structure, consisting of plain text with values separated by commas, makes them an accessible choice for data storage and exchange.
However, CSV files have their limitations. They are not the most efficient format for handling large datasets, as they can become cumbersome and slow to process. Additionally, CSVs are limited in their ability to handle complex data structures or to store metadata. They lack capabilities for data typing, hierarchy, and relationships, which are essential for more complex data management needs. This limitation makes them less suitable for applications requiring advanced data storage and retrieval functionalities.
I use the readr package to write and read CSV files in this post. The nice thing is that readr actually preserves the column types through its powerful built-in parser.
# A tibble: 6 × 2
data_r_csv data_python_csv
<chr> <chr>
1 character character
2 Date Date
3 POSIXct, POSIXt POSIXct, POSIXt
4 numeric numeric
5 numeric numeric
6 logical logical
To write and read CSV files, I use the corresponding pandas methods. To parse the dates correctly when reading in data, you actually need to add parse_dates = ["date_column", "datetime_column"] to pandas.read_csv(). If you don’t specify this, then pandas automatically reads dates as strings.
data_python.to_csv("data_python.csv", index =False)data_python_csv = pd.read_csv("data_python.csv")data_python_csv.info()
If you are willing to explicitly specify the date columns in Python, then CSV is actually a valid option to transfer data between R and Python. We could even open CSV files in Excel and manipulate them there, so the simplicity is quite striking.
If you work with large amounts of data or if you want to parallelize read or write operations on the same file, then CSV is definitely not for you. On the one hand, it is hard to read only parts of a CSV file (e.g., only rows that fulfill certain criteria). On the other hand, you will most likely run into all kinds of concurrency issues due to file locking.
With these issues in mind, let us move on to more advanced storage technologies.
SQLite
SQLite is a highly regarded, lightweight, file-based SQL database that stands out for its simplicity and ease of use. As a self-contained, serverless database engine, it provides a robust platform for storing and managing data without the need for a separate server setup. This feature makes SQLite exceptionally suitable for applications that require an embedded database system or for development purposes.
However, SQLite has its limitations. It is not typically recommended for very large-scale applications or those requiring high levels of concurrency. SQLite handles concurrency at a basic level but is not optimized for situations where numerous processes need to write to the database simultaneously. For such applications, more robust database systems like PostgreSQL or MySQL are generally preferred. Nonetheless, for smaller applications, educational purposes, and situations where a lightweight and easy-to-deploy database is needed, SQLite is an excellent choice.
The RSQLite package provides a seamless integration of SQLite databases into R projects. The typical workflow starts with initializing a connection to a database (which also creates an SQLite database if it does not exist). To check how SQLite treats our column types, I just write the data to the local database and then read it again into a different data frame. Note that it is also good practice to close database connections again after you are done with whatever you need from it (in particular when multiple processes target the same database).
We can see that the column types are not automatically preserved, since dates, datetime and logical columns are converted to integers.2 Note that you can use the extended_types = TRUE option in dbConnect() if you want your SQLite database to handle dates correctly in R (the other programming languages don’t care about this option).
If we now read from the SQLite databases created in Python, we can see that things get even more complicated: pandas stores dates as characters, so you need to convert them to dates or datetime.
# A tibble: 6 × 2
data_r_sqlite data_python_sqlite
<chr> <chr>
1 character character
2 numeric character
3 numeric character
4 numeric numeric
5 integer integer
6 integer integer
I use the sqlite3 package to perform the same tasks as in R: create a database (or connect to an existing one), write the data to the database, and read the data again to
Similar to CSV, you have to add the parse_dates = ["date_column", "datetime_column"] option to correctly read back in dates using pandas.read_sql_query(). Otherwise dates are simply string columns.
import sqlite3con_sqlite_python = sqlite3.connect("data_python.sqlite")res = data_python.to_sql("data", con_sqlite_python, if_exists ="replace", index =False)data_python_sqlite = pd.read_sql_query("SELECT * FROM data", con_sqlite_python)con_sqlite_python.close()data_python_sqlite.info()
When I compare the SQLite data to the database that I have created in R, we can see that dates are handled differently because R uses integers to store date and datetime, which are actually interpreted as numeric by Python.
If you stay in the same programming language and only work with up to a couple of giga bytes of data, then SQLite is a great database. However, if you want to pull data from the same data base in different programming languages, then you have to be really careful with respect to column types.
DuckDB
DuckDB is an emerging database management system that is gaining attention for its remarkable efficiency and user-friendliness. It is specifically designed to be an OLAP (Online Analytical Processing) database, making it well-suited for executing analytical queries on large datasets. This focus positions DuckDB as a highly efficient tool for data analytics. One of the standout features of DuckDB is its ability to run directly within data analysis environments. It integrates seamlessly with popular data science languages and tools, allowing data scientists to perform analysis within their familiar programming environment.
However, being a relatively new addition to the world of database management systems, DuckDB might not yet offer the same level of community support, breadth of tools, and integrations as more established databases (such as SQLite).
The duckdb package contains everything we need to set up the database and execute the same steps as with SQLite. If I write the data to and then read it back in, I get the same column types again - this is great!
If I read data from the database create in Python, we only see a small difference: while we have Date in R, we get a POSIXct column from the Python database. For many applications, this difference should have no impact. However, if you work with time series data, make sure to have the correct date format before you proceed!
# A tibble: 6 × 2
data_r_duckdb data_python_duckdb
<chr> <chr>
1 character character
2 Date POSIXct, POSIXt
3 POSIXct, POSIXt POSIXct, POSIXt
4 numeric numeric
5 integer integer
6 logical integer
I use the duckdb package to create the database and the sqlalchemy package to handle the connection. Note that I don’t have to disconnect manually because the engine is taking care of this automatically. Again, the great thing is that we have the same column types for the data that I put into and the data that I have pulled out of the database!
import duckdbfrom sqlalchemy import create_enginecon_duckdb_python = create_engine("duckdb:///data_python.duckdb")res = data_python.to_sql("data", con_duckdb_python, if_exists ="replace", index =False)data_python_duckdb = pd.read_sql_query("SELECT * FROM data", con_duckdb_python)data_python_duckdb.info()
If I now read the data from the database generated in R, then we see that the only difference again arises in the date column, but at least the datetime columns are aligned.
DuckDB is a great alternative if you want to play around with new data storage technologies and if you work with multiple programming languages. The examples above show a high overlap between column types with the only major issue that the Date type from R is interpreted as a string in Python.
Parquet
Parquet is a highly efficient columnar storage format, increasingly popular in the field of data analytics, particularly when dealing with large datasets. This format is structured in a way that stores data by columns rather than by rows, which is typical in traditional database formats. This columnar approach allows for more efficient data retrieval and scanning, making it an excellent choice for analytical querying where operations are often performed on specific columns of data (similar to DuckDB).
One of the key strengths of Parquet is its ability to compress data effectively. It supports various compression and encoding schemes, which can significantly reduce the storage footprint of large data sets without sacrificing read performance. This efficiency in data storage and compression makes Parquet an attractive option for systems where storage cost and I/O performance are critical considerations, such as in big data applications and cloud environments.
The arrow package provides all we need to write and read Parquet files. Similar to DuckDB, all column types are preserved exactly as to before we wrote the data frame to a parquet file.
Feather primarily designed for efficient data interchange between R and Python. Developed jointly by R and Python teams, Feather provides a fast, lightweight, and easy-to-use binary file format for storing data frames. However, Feather is primarily designed as an intermediate file format for data interchange rather than for long-term data storage. It does not offer the same level of compression as formats like Parquet, nor is it optimized for queries and data analysis tasks directly on the stored files.
The arrow package also provides all we need to write and read Feather files. Similar to DuckDB and Parquet, all column types are preserved exactly as to before we wrote the data frame to a parquet file.
If I now read the data from the database generated in R, then we see that the only difference again arises in the date column (just as with DuckDB and Parquet).
There are two main takeaways: first, R’s handling of dates is peculiar and cannot be smoothly aligned with Python (at with pandas). I think I need to ask myself the question how often I actually need the Date type if I want to avoid issues with other languages. Second, Parquet is an exciting format for storing data and I’ll definitely use it more in future applications.
Footnotes
There is also a dedicated StringDtype for string data, introduced in more recent versions of pandas (version 1.0.0 and later), but it seems that it is not used often.↩︎
In R, dates are typically represented as the number of days since January 1, 1970, known as the Unix epoch. This is a standard date reference used in many programming languages. The integer is positive for dates after January 1, 1970, and negative for dates before it. For example, a Date object representing January 2, 1970, would have a value of 1, as it is one day after the epoch start. Similarly, POSIXct is a count of the number of seconds since the Unix epoch (January 1, 1970, 00:00:00 GMT).↩︎