Using multiprocessing and sqlite3 together

Note from the author: this is a pseudo TIL, but I hadn't seen it written down anywhere, hopefully someone finds it useful! Jump to "Solution" below if you don't care about the background.

Background

Generating Data

I recently started working on a reinforcement learning project, and I needed to generate a lot of training data. The project involves quantum compilers, so the data I generate is quantum circuits. For those unfamiliar, quantum circuits are just sequences of unitary matrices laid out in a particular order. I chose to store the circuit as a sequence of unitary gate names. The output of the data generation is the unitaries (numpy arrays) that are the result of multiplying the matrices in the circuit together.

I ended up wanting to generate somewhere in the region of a few hundred billion matrices, each of them very small. I knew off the bat that this would require a fair bit of time, and I wanted to take advantage of the 32 core server I own. Since I was using Python to generate this data, I used the multiprocessing module. Sadly I cannot yet take advantage of Python multithreading coming in 3.12.

Disk Space Woes

For saving the generated matrices, I started off by doing the simplest thing, just using plain-old np.savetxt to save the (pretty tiny) matrices to disk in each process after computing the product of the matrices in the quantum circuit. This... was problematic. I quickly ran into a disk out of space error. Normally this means I need to clear out space on whatever VM I am using, but there was one problem -- I still had hundreds of gigabytes of space left on disk!

I quickly deduced the error actually was caused by hitting the limit on entries in a directory, dang it CIFS! I briefly tried to make my own schema to split the unitaries into more directories to avoid this limit but I ended up hitting more file system limits. It was clear just writing files to disk wouldn't scale to the size of dataset I needed to generate.

Choosing a Database

Of course, dealing with so many small files, a database was the right solution to this problem. Why didn't I start with a database to begin with? Partially because I wanted to make it easy to load individual unitaries (np.loadtxt is an incredibly handy API). Also, I was just hacking this data generation script together.

I had one issue with switching to a database: I wanted something simple and lightweight, I didn't need anything fancy like postgres or the like. Sqlite is the obvious choice but sqlite does not by default support concurrent writes, which is exactly what I wanted to do!

Solution

So how can one achieve concurrent writes in Python using sqlite3? Sqlite by default uses a rollback log to maintain consistency. You can change the configuration so that sqlite uses a write-ahead log (WAL) mode as well, which allows for concurrent writes. You can enable WAL mode in Python by setting the journal mode:

# assume some Cursor object `cursor`
cursor.execute('PRAGMA journal_mode = WAL')

However, I started getting exceptions part way through. Some processes calculating the unitaries were being told the database was locked, even though it should not be (these processes should be writing to the WAL, which I want to always be available). Therefore I also set the sqlite pragma synchronous to OFF, which means that the WAL does not synchronize before checkpoints. Note this is dangerous because theoretically your database could become corrupted if the process crashes or the server shuts down. This is acceptable to me because I can always regenerate the database and either of these are very unlikely to occur while I run these data generation tasks. This can be done in Python like so:

# assume some Cursor object `cursor`
cursor.execute('PRAGMA synchronous = OFF')

In summary, by enabling the WAL and turning some sync'ing off, I was able to get multi-processed Python code to concurrently write to a sqlite database. This also gave a nice speed bump since sqlite is optimized for writing many small amounts of data to disk, a nice bonus!

Rust CLI tools apt repo

tl;dr Go to http://apt.cli.rs and follow the instructions to add the apt repo

I guess its only fair to start my new blogging kick by catching up on a project I'd worked on several months ago which is pretty much "done." I really like several Rust tools, such as ripgrep, hyperfine, and fd. I wanted an easy way to install them on Debian-based systems that may not have them in the official repos, so I needed to create my own apt repo. I was originally considering making a ppa, or private package archive, since I mainly use Ubuntu, but I decided I wanted these tools available on Debian as well, so my only option was an apt repo.

It turns out that apt repos are really simple, you just need to serve a directory with e.g. nginx. It took me a while to find a tool I liked using for making the apt repo, however. I started with trying reprepro, but I found it was more annoying to use than I wanted, so I ended up using aptly, a newer apt repo management tool.

When setting up the repo, my criteria for inclusion were:

  • the tool must be able to build *.deb packages. This is the package format for Debian/Ubuntu so definitely required
  • the tool must build those packages. I wanted to use the official/same binaries as those in the Github release
  • the binary packages must be statically musl-linked. Since various versions of Debian/Ubuntu use different versions of glibc, you can get version compatibility errors if you don't statically link. Statically linking to musl is much easier than glibc, so this seemed to be the best route to avoid compatibility errors.

With these criteria in hand, I then set up the apt repo with the CLI tools that fit and wrote a script to try updating all of the packages, pulling from the official Github releases. I have subscribed to releases on Github from all the repos that are currently included in the apt repo so all I have to do is run the script when a new release comes out from one of the repos. I may end up setting up a cronjob that runs this update script, but I'm weighing whether it is worth any potential risks...

Anyway, if you are interested in using this apt repo, head on over to https://apt.cli.rs, I've written up the commands you need to get started using the apt repo.

If you have suggestions for Rust CLI tools that should be included, please open an issue on the github! If a package doesn't build musl-linked Debian packages, consider opening an issue on that project to add it. However, please do not spam maintainers asking for these packages to be built. I unfortunately do not have time to contribute to the packaging of every great Rust CLI tool, but I can add them to the apt repo if they are packaged already!

Anyway, hopefully someone else finds it useful!