Half-Baked Data, Unfinished Blogs, and Dependency Resolution Hell
Sorry in advance, the timeline below isn’t working quite well on mobile. Will fix soon!
tl;dr
- Don’t let perfect me the enemy of good enough.
- Sometimes a simple solution is better than a fully correct solution.
- Share early, share often.
- I have my own data playground now: https://data.teonbrooks.com
How It Started
I initially began working on this forthcoming blog three years ago and thought that it would be a simple afternoon project, but who hasn’t left a blog post unfinished for years with the promised to return to it 😅
So this all started because I had this idea to create a simple data playground on my blog for bite-sized data science projects.
The setup was seemingly straightforward and my inaugural bite-sized project started off like most well-intended projects:
- properly scoped
- MVP
But this little project that could ended up falling victim to scope-creep, perfectionism, grandiose plans, and multiple acts of abandonment.
Recently, I finally decided to just publish the damned thing and move on with my life, but before I did, I wanted to document the papercuts of being on the bleeding edge of using Python on the web using Jupyterlite, powered by Pyodide, the eternal love-hate relationship of packaging, versioning, and dependency resolution in Python and scientific computing.
All I wanted to do was write a simple blog post but it turned into a whodunit 😶
I will leave the details of the blog post for the blog post itself, don’t want to steal its thunder. The gist is that I had an idea and I wanted to do some interactive data science for the core part of the piece.
Also, I have learned a lot about how the Pyodide build processing and Python versioning works so you don’t have to. Sit back, relax, and come join me on a adventure through dependency hell 😈
Motivation
The idea I had was actually quite simple:
I’m a data scientist and often I have an idea for a project that I want to hack on for a little bit, to scratch that data science itch, but it’s only minorly interesting or impactful.
Data science is primarily done in one of two languages: Python or R, but special shoutout to Julia (I would like to explore using you more), and I tend to favor the former. The workflow typically goes something like this: you create a Jupyter Notebook and maybe host that notebook on GitHub, but one crucial aspect is that you often lose interactivity or the exploration ability.
A key part of data science and reporting is that you answer questions with data, but what happens when you or the reader have another question? Do you have to setup new infrastructure to answer a follow-up question?
This has often been a frustrating point for data scientists. We share a report but if we want it to have a slider or reset with initial values or state, that would be a separate project endeavor in itself.
I wanted to have a place on the web for these simple explorations that I can easily share and also let folks explore it further if they wanted to. I named this project: Half-baked Data
🎉
This idea was finally coming to fruition. There was a new promising project that just landed that satisfied all the requirements I needed. This was Jupyterlite, which is the Jupyterlab environment running fully in the browser with no server setup, all powered by Pyodide. I could host all my data science explorations using Github Pages with full interactivity.
My Data Science Notebook Requirements
- ✅ It has support for Python
- ✅ It runs in the browser
- ✅ It doesn’t require any setup
- ✅ Maintains interactivity
- ✅ I can embed it in my blog
However, at the moment, it does have a few shortcoming:
- ❌ Notebooks are currently monolingual
- ❌ Not all libraries are supported
- ❌ Uses Notebook paradigm (I prefer RMarkdown/Quarto/MyST environments for writing reports)
When the Rubber meets the Road
The Jupyterlite setup portion was pretty straightforward:
- I cloned the demo template and set up my repo.
- I deployed it to one of my subdomain so now I have https://data.teonbrooks.com as my very own Jupyterlite instance, which is pretty cool 😎
Working with Photos in Python
One of the popular image libraries in Python is Pillow. It has broad support for some of the most common image formats. That seemed all well and good until I realized all the photos I had taken were saved with Apple’s new file format, HEIC.
Fucking Apple!
The HEIC format, is a special case of HEIF that is used by Apple for their mobile devices (cf. ref). This format is not natively supported in Pillow. After I did some searching, I came across the following different packages:
pillow-heif
- a library with bindings tolibheif
. It provides a plugin to Pillowpi-heif
- a bite-sized version ofpillow-heif
that only offers HEIF decoding. Built from thepillow-heif
repopyheif
- a different binding library tolibheif
that only supports reading HEIF, doesn’t have a plugin forPillow
Putting it All Together
I started doing some of the data explorations in a notebook locally and things seem to be working. This part really was quick. I was able to install the libraries, I had all the files local to my computer so I could easily iterate over them, and I manage to get locations added a map (I’m really burying the lead on the project).
Now I need to figure out to recreate this magic on the web.
Accessing files
So file storage isn’t (readily) free 😅. You need to find somewhere to host your photos/dataset. This is something worth noting when you want to get a project off the ground and you want someone to be able to recreate your steps exactly.
One piece of advice I wish I had followed and listened to (I even told myself this early on), is reduce your data payload. You can do do this a few different ways:
- Compress yodur images
- Export the metadata if the metadata is really all you need
I ended up storing all the files with git-lfs
. At first I stored a zip of the directory but then I had figure out how to read a zip into memory in Jupyterlite and I really went down the rabbit hole on that one. I later unzipped and git-lfs
the individual images.
Note: Remember to fetch your files in your GitHub Actions when you’re building your site otherwise they won’t be there.
Getting the Package Right
Package in Pyodide, the Python distribution that powers Jupyterlite, works differently than traditional package in Python with tools such as pip
and conda
. It might be worth breaking this out into a separate post but here’s the crux of it:
The core Pyodide functionality, which is the CPython implementation and its standard library, has to get compiled to Webassembly, and this is done via Emscripten, along with all the necessary patches to make it work.
The “Big Four” and core scientific computing libraries: numpy
, scipy
, matplotlib
, and pandas
, also has to be compiled to Webassembly via Emscripten, which their patches as well.
Back when all of this was coming together, there weren’t elegant ways to get other packages included into the distribution other than to build it along with everything else. This means that there are a whole collection of packages that need to be builded when Pyodide is being built. This does lead to very long CI build times (~2hrs).
Fortunately there are so new and nifty solutions to help solve the building problem. For packages currently not being built with Pyodide, if it’s a pure Python package, then it can now be installed via micropip
. This is super neat!
The team has also worked to mirror the build process of the collection of packages that are currently built with Pyodide in a separate repo pyodide-recipes, and they plan to unvendor those packages from the distribution with 0.28 release (issue, pr).
Since libraries that depend on C needs to be built with Pyodide, their updates and release are contingent on the distribution release. I learned firsthand that not all versions play nicely with each other and that’s usually because of dependency resolution.
tl;dr:
- version-control your work
- hard-code dependencies
in a requirements.txt
Actually in the past few weeks, I started to use
uv
for package management for Python projects, and that’s been a breeze. Use it, it’s legit! You can define apyproject.toml
and it generates
Dependency Hell
With Pyodide (0.22), everything worked! [ship it 🚀 ] tbh, I probably should have written and release the blog post then. I naively bumped my Jupyterlite version and then everything came tumbling down.
It took me some time to figure out what went wrong. I spent some mapping out the time, place, and the murder weapon and decided a timeline would best highlights the events.
-
Jun 2022
Back in June 2022, I made a request to support
pyheif
for Pillow. -
Oct 2022
Pillow_heif Request
It isn't a simple Python Wheels, it has some c++ bindings so it requires a full build to WASM
- requires libheif
- requires libde265
- pillow_heif (0.6.1)
- pyheif (0.7.0)
- There is a difference in
pillow_heif
andpyheif
. I ended up needingpillow_heif
.
- requires libheif
-
Nov 2022
Pyodide (0.22) release
- Pyodide (0.22.0a2) release Nov 2022
- Pyodide (0.22.0) release Jan 2023
- Pillow (9.1.1)
- pillow_heif (0.8.0)
- Things are compatible ✅
- I should have written and shared the blog piece here, two years ago 🤣
- Cautionary tale about versioning
- Unbeknownst to me, the Pyodide (0.25) would break all things necessary for a straightforward blog post.
-
Sep 2023
Pyodide (0.24) release
- Pyodide (0.24) upgraded Pillow to (10.0.0)
- Just checked, still stable with pillow_heif (0.8.0) ✅
-
Jan 2024
Pyodide 0.25 release
- Pillow upgraded to (10.2.0) (commit)
- Changelog doesn't attribute package change
- It's in a commit with a lot of packages being upgraded (minus pillow-heif) so I guess changelog update was just skipped for that massive overhaul
- should probably still capture it though
- Pillow (10.2.0) breaks compatibility with pillow_heif (0.8.0)
- True culprit is Pillow=10.1.0 (release date: 2023-10-15)
- pillow_heif (0.13.0) fixes compatibility with Pillow (release date: 2023-08-09)
-
Sep 2024
Updated to latest Jupyterlite demo release
- Jupyterlite relies on
pyodide-kernel
to power the Jupyterlab experience - As of
pyodide-kernel
(0.4.0), it relies on Pyodide (0.26.0), the problematic build - Jupyterlite demo had been bumped to
jupyterlite-pyodide-kernel==0.4.2
pyodide-kernel
(0.4.2) relies on Pyodide (0.26.2)- Py3.12
- pillow (0.10.2)
- pillow_heif (0.8.0)
- Jupyterlite relies on
-
Sep 25, 2024
Pyodide Pull Request to bump
pillow_heif
-
Dec 2024
This PR superseded previous one to bump
pillow_heif
-
Jan 2025
- Pyodide (0.27.0) release - Jan 1, 2025
- pillow_heif (0.8.0) with mislabeled package name --> pillow-heif (0.20.0)
- pillow (10.2.0)
- Jupyterlite (0.6.0) release - Jan 2, 2025 - Support Pyodide 0.27 by jtpio · Pull Request #156 · jupyterlite/pyodide-kernel · GitHub
- Pyodide (0.27.0) release - Jan 1, 2025
Lessons Learned
Here are some things I could have done to speed the process along:
- Convert all the images from HEIC to JPG
- Done everything locally and embedded images
- Write the piece three years ago when things worked
- Exported the metadata into a JSON and just use that
Comments
Comments
Loading comments...