CIFS! I briefly tried to make my own schema to split the unitaries into more directories to avoid this limit but I ended up hitting more file system limits. It was clear just writing files to disk wouldn't scale to the size of dataset I needed to generate.
Of course, dealing with so many small files, a database was the right solution to this problem. Why didn't I start with a database to begin with? Partially because I wanted to make it easy to load individual unitaries (np.loadtxt
is an incredibly handy API). Also, I was just hacking this data generation script together.
I had one issue with switching to a database: I wanted something simple and lightweight, I didn't need anything fancy like postgres or the like. Sqlite is the obvious choice but sqlite does not by default support concurrent writes, which is exactly what I wanted to do!
So how can one achieve concurrent writes in Python using sqlite3? Sqlite by default uses a rollback log to maintain consistency. You can change the configuration so that sqlite uses a write-ahead log (WAL) mode as well, which allows for concurrent writes. You can enable WAL mode in Python by setting the journal mode:
# assume some Cursor object `cursor`
cursor.execute('PRAGMA journal_mode = WAL')
However, I started getting exceptions part way through. Some processes calculating the unitaries were being told the database was locked, even though it should not be (these processes should be writing to the WAL, which I want to always be available). Therefore I also set the sqlite pragma synchronous
to OFF
, which means that the WAL does not synchronize before checkpoints. Note this is dangerous because theoretically your database could become corrupted if the process crashes or the server shuts down. This is acceptable to me because I can always regenerate the database and either of these are very unlikely to occur while I run these data generation tasks. This can be done in Python like so:
# assume some Cursor object `cursor`
cursor.execute('PRAGMA synchronous = OFF')
In summary, by enabling the WAL and turning some sync'ing off, I was able to get multi-processed Python code to concurrently write to a sqlite database. This also gave a nice speed bump since sqlite is optimized for writing many small amounts of data to disk, a nice bonus!