Interactive Data Visualization with Python

Python

Visualization

A comparison of the dynamic visualization libraries plotly, bokeh and altair for the programming language Python

Author

Christoph Scheuch

Published

January 28, 2024

Interactive figures are an essential tool for communicating data insights, in particular in reports or dashboards. In this blog post, I compare different libraries for dynamic data visualization in Python. Before we dive into the comparison, here is a quick introduction to each contestant.

plotly is an interactive, open-source plotting library that enable the creation of publication-quality figures. It supports a wide range of chart types including line charts, scatter plots, bar charts, pie charts, bubble charts, heatmaps, and more advanced visualizations like 3D plots and geographical maps. One of the key features of plotly is its ability to produce interactive plots that users can zoom, pan, and hover over, providing tooltips and additional information, which makes it highly effective for data exploration and presentation.

bokeh is a powerful, flexible library for creating interactive plots and dashboards in the web browser. It is designed to help users create elegant, concise constructions of versatile graphics with high-performance interactivity over very large or streaming datasets. One of the core features of bokeh is its ability to generate dynamic javascript plots directly from Python code, which means you can harness the interactivity of web technologies without needing to write any javascript yourself. The plots can be embedded in HTML pages or served as standalone applications, making it a versatile choice for web development and data analysis tasks.

altair is a declarative statistical visualization library, designed to create interactive visualizations with a minimal amount of code. It is built on top of the powerful Vega and Vega-Lite visualization grammars, enabling the construction of a wide range of statistical plots with a simple and intuitive syntax. One of the key advantages of altair is its emphasis on data-driven visualization design. By allowing users to think about their data first and foremost, Altair facilitates the exploration and altairunderstanding of complex datasets.

I compare code to generate plotly, bokeh, and altair output in the post below. The types of plots that I chose for the comparison heavily draw on the examples given in R for Data Science - an amazing resource if you want to get started with data visualization. Spoiler alert: I’m not always able to replicate the same figure with all approaches (yet).

Loading libraries and data

We start by loading a dew data manipulation libraries, the main libraries and modules of interest, and palmerpenguins data. We then use the penguins data frame as the data to compare all functions and methods below. Note that I drop all rows with missing values because I don’t want to get into related messages in this post.

# Load data manipulation libraries
import pandas as pd
import numpy as np
import statsmodels.api as sm
from scipy.stats import gaussian_kde

# Load plotly visualization modules
import plotly.express as px
import plotly.graph_objs as go

# Load bokeh visualization modules
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.models import ColumnDataSource, CategoricalColorMapper, HoverTool, FactorRange
from bokeh.layouts import gridplot
from bokeh.transform import factor_cmap
from bokeh.palettes import Spectral6, Category10
output_notebook()

# Load altair visualization library
import altair as alt

# Load penguins data
from palmerpenguins import load_penguins
penguins = load_penguins().dropna()

Loading BokehJS ...

A full-blown example

Let’s start with an advanced example that combines many different aesthetics at the same time: we plot two columns against each other, use color and shape aesthetics do differentiate species, include separate regression lines for each species, manually set nice labels, and use a theme. You can click through the results in the tabs below.

Note that we have to manually add regression lines for plotly and bokeh, while altair has built-in support for them.

fig_full = (px.scatter(
    penguins, x = "bill_length_mm", y = "flipper_length_mm", 
    color = "species", symbol = "species",
    title = "Bill length vs. flipper length",
    labels = {"bill_length_mm": "Bill length (mm)", 
              "flipper_length_mm": "Flipper length (mm)",
              "species": "Species"})
  .update_traces(marker = dict(size = 10))
  .update_layout(
    plot_bgcolor = "white",
    xaxis = dict(zeroline = False, ticklen = 5),
    yaxis = dict(zeroline = False, ticklen = 5))
)

for species in penguins["species"].unique():
  penguins_subset = penguins[penguins["species"] == species]
  X = penguins_subset["bill_length_mm"]
  X = sm.add_constant(X)
  y = penguins_subset["flipper_length_mm"]

  model = sm.OLS(y, X).fit()
  line = model.params[0] + model.params[1] * penguins_subset["bill_length_mm"]

  fig_full.add_trace(
    go.Scatter(x = penguins_subset["bill_length_mm"], y = line, 
               mode = "lines", showlegend = False)
  )

fig_full

fig_full = figure(
  title = "Bill length vs. flipper length",
  x_axis_label = "Bill length (mm)", y_axis_label = "Flipper length (mm)",
  tools = "pan,wheel_zoom,box_zoom,reset,hover",
  tooltips = [
    ("Bill length (mm)", "@bill_length_mm"),
    ("Flipper length (mm)", "@flipper_length_mm"),
    ("Species", "@species")
  ]
)

species = penguins["species"].unique()
color_map = {
  species: color for species, color in zip(species, ["red", "green", "blue"])
}  

for species in species:
    penguins_subset = penguins[penguins["species"] == species]

    fig_full.scatter(
      source = ColumnDataSource(penguins_subset),
      x = "bill_length_mm", y = "flipper_length_mm", 
      legend_label = species, color = color_map[species], 
      size = 10, fill_alpha = 0.6
    )
              
    X = penguins_subset["bill_length_mm"]
    X = sm.add_constant(X)
    y = penguins_subset["flipper_length_mm"]

    model = sm.OLS(y, X).fit()
    predictions = model.predict(X)

    fig_full.line(
      penguins_subset["bill_length_mm"], predictions, 
      color =color_map[species], line_width = 2, legend_label = species
    )

fig_full.legend.title = "Species"
fig_full.legend.location = "top_left"
fig_full.background_fill_color = "white"
fig_full.border_fill_color = "white"
fig_full.outline_line_color = None
 
show(fig_full)

points = (alt.Chart(penguins)
  .mark_point(size=100, filled=True)
  .encode(
    x = alt.X("bill_length_mm", 
              scale = alt.Scale(zero = False), 
              title = "Bill length (mm)", 
              axis = alt.Axis(tickCount = 5, grid = False)),
    y = alt.Y("flipper_length_mm", 
              scale = alt.Scale(zero = False), 
              title = "Flipper length (mm)", 
              axis = alt.Axis(tickCount = 5, grid = False)),
    color = "species:N", shape = "species:N",
    tooltip = [alt.Tooltip("bill_length_mm", title = "Bill length (mm)"),
               alt.Tooltip("flipper_length_mm", title = "Flipper length (mm)"),
               alt.Tooltip("species", title = "Species")])
)

regression_lines = (alt.Chart(penguins)
  .transform_regression(
    "bill_length_mm", "flipper_length_mm", groupby = ["species"]
  )
  .mark_line()
  .encode(
    x = "bill_length_mm:Q", y = "flipper_length_mm:Q", 
    color = "species:N"
  )
)

fig_full = ((points + regression_lines)
  .properties(title = "Bill length vs. flipper length")
  .configure_view(stroke = "transparent", fill = "white")
  .configure_axis(labelFontSize = 10, titleFontSize = 12)
)

fig_full

Visualizing distributions

A categorical variable

Let’s break down the differences in smaller steps by focusing on simpler examples. If you have a categorical variable and want to compare its relevance in your data, then bar charts are your friends. The code chunks below show you how to implement them for each approach.

island_counts = (penguins
  .groupby("island")
  .size()
  .reset_index(name = "n")
)

(px.bar(island_counts, x = "island", y = "n")
  .update_layout(barmode = "stack")
)

island_counts = (penguins
  .groupby("island")
  .size()
  .reset_index(name = "n")
)

islands = island_counts["island"].unique()

fig_bar = figure(x_range = islands,
                 tools = "pan,wheel_zoom,box_zoom,reset,hover")

fig_bar.vbar(source = ColumnDataSource(island_counts),
             x = "island", top = "n", width = 0.9, line_color = "white")
show(fig_bar)

island_counts = (penguins
  .groupby("island")
  .size()
  .reset_index(name = "n")
)

(alt.Chart(island_counts)
  .mark_bar()
  .encode(x = "island", y = "n",
          tooltip = ["island", "n"])
)

A numerical variable

If you have a numerical variable, usually histograms are a good starting point to get a better feeling for the distribution of your data. You can quickly create histograms in plotly and altair, while you have to manually construct the the histogram from individual bars in bokeh.

(px.histogram(penguins, x = "bill_length_mm")
  .update_traces(xbins = dict(size =  2))
)

bin_size = 2
bins = np.arange(
  start = penguins["bill_length_mm"].min(), 
  stop = penguins["bill_length_mm"].max() + bin_size, 
  step = bin_size
)
hist, edges = np.histogram(penguins["bill_length_mm"], bins = bins)

source = ColumnDataSource(
  data = dict(top = hist, left = edges[:-1], right = edges[1:])
)

fig_histogram = figure(tools = "pan,wheel_zoom,box_zoom,reset,hover")

fig_histogram.quad(
  source = source, 
  bottom = 0, top = "top", left = "left", right = "right",
  fill_color = "skyblue", line_color = "white"
)

show(fig_histogram)

(alt.Chart(penguins)
  .mark_bar()
  .encode(
    x = alt.X("bill_length_mm:Q", bin = alt.Bin(step = 2)),
    y = alt.Y("count()"),
    tooltip = [alt.Tooltip("bill_length_mm:Q", bin = alt.Bin(step = 2)),
               alt.Tooltip("count()")]
  )
)

Visualizing relationships

A numerical and a categorical variable

To visualize relationships, you need to have at least two columns. If you have a numerical and a categorical variable, then histograms or densities with groups are a good starting point. The next example illustrates the use of densities. plotly and altair have built-in support for densities, while you have to manually compute the densities in bokeh.

(px.histogram(penguins, x = "body_mass_g", color = "species",
              histnorm = "density", barmode = "overlay", opacity = 0.5)
  .update_traces(marker_line_width = 0.75)
)

/Users/krise/Documents/GitHub/tidy-intelligence/blog/renv/python/virtualenvs/renv-python-3.10/lib/python3.10/site-packages/plotly/express/_core.py:2065: FutureWarning:

When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.

species = penguins["species"].unique()

colors = Category10[len(species)]

fig_densities = figure(tools = "pan,wheel_zoom,box_zoom,reset,hover")

for j, species in enumerate(species):
    penguins_subset = penguins[penguins["species"] == species]
    body_mass = penguins_subset["body_mass_g"].dropna()
    
    kde = gaussian_kde(body_mass)
    x_range = np.linspace(body_mass.min(), body_mass.max(), 100)
    density = kde(x_range)

    source = ColumnDataSource(
      data = dict(body_mass_g = x_range, density = density, species = [species]*len(x_range))
    )
    
    fig_densities.patch(
      source = source, 
      x = "body_mass_g", y = "density", 
      alpha = 0.5, color = colors[j], legend_label = species
    )

show(fig_densities)

(alt.Chart(penguins)
  .transform_density("body_mass_g", 
                     as_ = ["body_mass_g", "density"], groupby = ["species"])
  .mark_area(opacity = 0.5)
  .encode(
    x = alt.X("body_mass_g:Q"), y = alt.Y("density:Q"),
    color = "species:N", tooltip = ["species:N", "body_mass_g:Q"]
  )
)

Two categorical columns

Stacked bar plots are a good way to display the relationship between two categorical columns. For plotly and altair, we simply compute the percentages by species and island and put them into the bar plotting functions. Note that bokeh is peculiar because it requires the data in wide format for stacked bar charts.

species_island_counts = (penguins
  .groupby(["species", "island"])
  .size()
  .reset_index(name = "n")
  .assign(
    percentage = lambda x: x["n"] / x.groupby("species")["n"].transform("sum")
  )
)

px.bar(species_island_counts, x = "species", y = "percentage", 
       color = "island", barmode = "stack")

/Users/krise/Documents/GitHub/tidy-intelligence/blog/renv/python/virtualenvs/renv-python-3.10/lib/python3.10/site-packages/plotly/express/_core.py:2065: FutureWarning:

When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.

species_island_counts = (penguins
  .groupby(["species", "island"])
  .size()
  .reset_index(name = "n")
  .assign(
    percentage = lambda x: x["n"] / x.groupby("species")["n"].transform("sum")
  )
)

species_island_counts_wide = (species_island_counts
  .pivot(index = "species", columns = "island", values = "percentage")
  .fillna(0)
)

fig_stacked = figure(x_range = penguins["species"].unique())

fig_stacked .vbar_stack(
  source = ColumnDataSource(data = species_island_counts_wide), 
  stackers = penguins["island"].unique(), x = "species", 
  width = 0.9, color = ["red", "blue", "green"]
)

show(fig_stacked)

species_island_counts = (penguins
  .groupby(["species", "island"])
  .size()
  .reset_index(name = "n")
  .assign(
    percentage = lambda x: x["n"] / x.groupby("species")["n"].transform("sum")
  )
)

(alt.Chart(species_island_counts)
  .mark_bar()
  .encode(
    x = "species", y = "percentage", color = "island",
    order = alt.Order("island", sort = "ascending"),
    tooltip = ["species", "island", "percentage"]
  )
)

Two numerical columns

Scatter plots and regression lines are definitely the most common approach for visualizing the relationship between two numerical columns and we focus on scatter plots for this example (see the first visualization example if you want to see again how to add a regression line). Note that altair axis ranges by default includes 0,so you need to manually tell the scale to ignore it.

(px.scatter(penguins, x = "bill_length_mm", y = "flipper_length_mm")
  .update_traces(marker = dict(size = 10))
)

fig_scatter = figure(tools = "pan,wheel_zoom,box_zoom,reset,hover")

fig_scatter.circle(
  source = ColumnDataSource(penguins),
  x = "bill_length_mm", y = "flipper_length_mm", 
  size = 10
)

show(fig_scatter)

(alt.Chart(penguins)
  .mark_circle(size = 100)
  .encode(
    x = alt.X("bill_length_mm", scale = alt.Scale(zero = False)),
    y = alt.Y("flipper_length_mm", scale = alt.Scale(zero = False)),
    tooltip = ["bill_length_mm", "flipper_length_mm"]
  )
)

Three or more columns

You can include more information by mapping columns to additional aesthetics. For instance, we can map colors and shapes to species and create separate plots for each island by using facets. Facets are actually a great way to extend your figures, so I highly recommend playing around with them using your own data.

Facets in bokeh involve a more manual process because it doesn’t have a direct equivalent of plotly’s facet_col parameter or altair’s facet() method. Instead, you’ll create individual plots for each facet and arrange them in a grid, which also means that you cannot have an automatically shared legend.

px.scatter(
  penguins, 
  x = "bill_length_mm", y = "flipper_length_mm", 
  color = "species", facet_col = "island"
)

islands = penguins["island"].unique()
species = penguins["species"].unique()

color_mapper = CategoricalColorMapper(
  factors = species, palette = ["red", "green", "blue"]
)

plots = []
for island in islands:
  penguins_subset = penguins[penguins["island"] == island]
    
  p = figure(tools="pan,wheel_zoom,box_zoom,reset", 
             width = 250, height = 250)
    
  p.circle(x = "bill_length_mm", y = "flipper_length_mm", 
           source = ColumnDataSource(penguins_subset),
           color = {"field": "species", "transform": color_mapper},
           legend_field = "species", size = 8)
    
  plots.append(p)

fig_grid = gridplot(plots, ncols = 3)

show(fig_grid)

(alt.Chart(penguins)
  .mark_circle()
  .encode(
    x = alt.X("bill_length_mm", scale = alt.Scale(zero = False)),
    y = alt.Y("flipper_length_mm", scale = alt.Scale(zero = False)),
    tooltip = ["bill_length_mm", "flipper_length_mm"],
    color = "species:N"
  )
  .facet(column = "island:N")
)

Time series

As a last example, we quickly dive into time series plots where you typically show multiple lines over some date vector. Here, I aggregate the number of penguins by year and island and plot the corresponding lines. While you can simply define colors and line types in plotly and altair plotting functions, you have to manually loop in bokeh.

year_island_count = (penguins
  .groupby(["year", "island"])
  .size()
  .reset_index(name = "n")
)

px.line(year_island_count, 
        x = "year", y = "n", 
        color = "island", line_shape = "linear", line_dash = "island")

islands = year_island_count["island"].unique()
colors = ["blue", "green", "red"]
dashes = ["solid", "dashed", "dotdash"] 

fig_time_series = figure(tools = "pan,wheel_zoom,box_zoom,reset,hover")

for j, island in enumerate(islands):
  year_island_count_subset = year_island_count[
    year_island_count["island"] == island
  ]
  
  fig_time_series.line(
    source = ColumnDataSource(year_island_count_subset),
    x = "year", y = "n", 
    legend_label = island, 
    color = colors[j % len(colors)], 
    line_dash = dashes[j % len(dashes)], line_width = 2
  ) 

show(fig_time_series)

(alt.Chart(year_island_count)
  .mark_line()
  .encode(
    x = "year:T", y = "n:Q",
    color = "island:N", strokeDash = "island:N",
    tooltip = ["year", "n", "island"]
  )
)

Conclusion

plotly, bokeh, and altair each cater to distinct visualization needs in Python. plotly shines with its interactive, high-quality visuals and ease of embedding in web applications, making it ideal for creating complex interactive charts and dashboards. bokeh is focused on real-time data visualizations and interactivity, particularly suited for web apps that require dynamic data streaming. Its strength lies in the seamless integration of Python code with web technologies. altair offers a declarative approach, emphasizing simplicity and efficiency in creating elegant statistical visualizations with minimal code, making it ideal for exploratory data analysis in notebooks.