Revamping my blog... again

Background

Well, I've succumbed to the ever-present urge to completely change one's blog setup. It all started when I wanted to add my blog to the Awesome PyLadies' blogs repo. As part of the configuration you can add your blog's RSS feed (structured information about a blog's contents). But the configuration says:

if you wish to have your blog posts being promoted by the Mastodon bot; the RSS feed should be for Python-related posts

My previous blog generator was zola, which worked really well and was easy to set up! However, zola does not support per-tag (or "taxonomy" as zola calls them) feeds. I considered contributing support for this to zola, but I figured I'd look around at other static site generators and see what they support. My blog content is just a bunch of Markdown files after all, so it should be easy to move to another static site generator!

Yak shaving, for fun and profit

I came across Pelican, which was really appealing for a few reasons. First, it supported per-feed RSS feeds. But also, it is written in Python and I felt like it would be fitting since I am a Pythonista. So I decided I would try to port my blog to Pelican. As you may be able to tell by looking at the footer, I did so successfully :)

Setting up Pelican is actually super easy. I installed pelican with markdown support by running uv tool install pelican[markdown] and ran pelican-quickstart to set up a project. After answering a few prompts, I had a full project set up and could copy over the Markdown files used to write this blog. After changing the metadata from zola's format to Pelican's, I had a blog generated... with no theme.

Oh... I needed to see what themes were available. Fortunately Pelican makes this easy by going to the pelicanthemes.com website. That site has a number of community authored themes. Unfortunately, I didn't see any themes I loved.

Introducing pelican-theme-terminimal

So, I did the only natural thing to do and ported the zola theme I was using to Pelican. Fortunately, this wasn't actually too bad. Zola uses Tera for its templates, which is based on Jinja2, which is what Pelican uses. So for the most part I could minimally update the variables used and get the theme ported over easily. The layout between the two is slightly different so I had to restructure how things are designed, but overall it was pretty easy and enjoyable.

You can check out the theme's code here. I don't plan on working on the theme a ton more, mostly just to add features or customizations I want, but it is open source if anyone else wants to use it or submit patches.

The top priorities I have to work on are:

  • Links to RSS feeds
  • Mastodon verification

So yeah, my blog is now running on Pelican and Python 🎉

I have a few ideas to blog about over the next week or two so check back soon, or subscribe to my RSS feed.

Using multiprocessing and sqlite3 together

Note from the author: this is a pseudo TIL, but I hadn't seen it written down anywhere, hopefully someone finds it useful! Jump to "Solution" below if you don't care about the background.

Background

Generating Data

I recently started working on a reinforcement learning project, and I needed to generate a lot of training data. The project involves quantum compilers, so the data I generate is quantum circuits. For those unfamiliar, quantum circuits are just sequences of unitary matrices laid out in a particular order. I chose to store the circuit as a sequence of unitary gate names. The output of the data generation is the unitaries (numpy arrays) that are the result of multiplying the matrices in the circuit together.

I ended up wanting to generate somewhere in the region of a few hundred billion matrices, each of them very small. I knew off the bat that this would require a fair bit of time, and I wanted to take advantage of the 32 core server I own. Since I was using Python to generate this data, I used the multiprocessing module. Sadly I cannot yet take advantage of Python multithreading coming in 3.12.

Disk Space Woes

For saving the generated matrices, I started off by doing the simplest thing, just using plain-old np.savetxt to save the (pretty tiny) matrices to disk in each process after computing the product of the matrices in the quantum circuit. This... was problematic. I quickly ran into a disk out of space error. Normally this means I need to clear out space on whatever VM I am using, but there was one problem -- I still had hundreds of gigabytes of space left on disk!

I quickly deduced the error actually was caused by hitting the limit on entries in a directory, dang it CIFS! I briefly tried to make my own schema to split the unitaries into more directories to avoid this limit but I ended up hitting more file system limits. It was clear just writing files to disk wouldn't scale to the size of dataset I needed to generate.

Choosing a Database

Of course, dealing with so many small files, a database was the right solution to this problem. Why didn't I start with a database to begin with? Partially because I wanted to make it easy to load individual unitaries (np.loadtxt is an incredibly handy API). Also, I was just hacking this data generation script together.

I had one issue with switching to a database: I wanted something simple and lightweight, I didn't need anything fancy like postgres or the like. Sqlite is the obvious choice but sqlite does not by default support concurrent writes, which is exactly what I wanted to do!

Solution

So how can one achieve concurrent writes in Python using sqlite3? Sqlite by default uses a rollback log to maintain consistency. You can change the configuration so that sqlite uses a write-ahead log (WAL) mode as well, which allows for concurrent writes. You can enable WAL mode in Python by setting the journal mode:

# assume some Cursor object `cursor`
cursor.execute('PRAGMA journal_mode = WAL')

However, I started getting exceptions part way through. Some processes calculating the unitaries were being told the database was locked, even though it should not be (these processes should be writing to the WAL, which I want to always be available). Therefore I also set the sqlite pragma synchronous to OFF, which means that the WAL does not synchronize before checkpoints. Note this is dangerous because theoretically your database could become corrupted if the process crashes or the server shuts down. This is acceptable to me because I can always regenerate the database and either of these are very unlikely to occur while I run these data generation tasks. This can be done in Python like so:

# assume some Cursor object `cursor`
cursor.execute('PRAGMA synchronous = OFF')

In summary, by enabling the WAL and turning some sync'ing off, I was able to get multi-processed Python code to concurrently write to a sqlite database. This also gave a nice speed bump since sqlite is optimized for writing many small amounts of data to disk, a nice bonus!