<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Emma's Blog - multiprocessing</title><link>https://emmatyping.dev/</link><description/><atom:link href="https://emmatyping.dev/feeds/multiprocessing/rss.xml" rel="self"/><lastBuildDate>Fri, 19 May 2023 00:00:00 -0700</lastBuildDate><item><title>Using multiprocessing and sqlite3 together</title><link>https://emmatyping.dev/using-multiprocessing-and-sqlite3-together.html</link><description>&lt;blockquote&gt;
&lt;p&gt;Note from the author: this is a pseudo TIL, but I hadn't seen it written down anywhere, hopefully someone finds it useful!
Jump to "Solution" below if you don't care about the background.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h1&gt;Background&lt;/h1&gt;
&lt;h3&gt;Generating Data&lt;/h3&gt;
&lt;p&gt;I recently started working on a reinforcement learning project, and I needed to generate a lot of training data. The project involves quantum compilers, so the data I generate is quantum circuits. For those unfamiliar, quantum circuits are just sequences of unitary matrices laid out in a particular order. I chose to store the circuit as a sequence of unitary gate names. The output of the data generation is the unitaries (numpy arrays) that are the result of multiplying the matrices in the circuit together.&lt;/p&gt;
&lt;p&gt;I ended up wanting to generate somewhere in the region of a few hundred billion matrices, each of them very small. I knew off the bat that this would require a fair bit of time, and I wanted to take advantage of the 32 core server I own. Since I was using Python to generate this data, I used the multiprocessing module. Sadly I cannot yet take advantage of &lt;a href="https://martinheinz.dev/blog/97"&gt;Python multithreading coming in 3.12&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Disk Space Woes&lt;/h3&gt;
&lt;p&gt;For saving the generated matrices, I started off by doing the simplest thing, just using plain-old &lt;code&gt;np.savetxt&lt;/code&gt; to save the (pretty tiny) matrices to disk in each process after computing the product of the matrices in the quantum circuit. This... was problematic. I quickly ran into a disk out of space error. Normally this means I need to clear out space on whatever VM I am using, but there was one problem -- I still had hundreds of gigabytes of space left on disk!&lt;/p&gt;
&lt;p&gt;I quickly deduced the error actually was caused by hitting the limit on entries in a directory, dang it &lt;a href="https://cifs.com/"&gt;CIFS&lt;/a&gt;! I briefly tried to make my own schema to split the unitaries into more directories to avoid this limit but I ended up hitting more file system limits. It was clear just writing files to disk wouldn't scale to the size of dataset I needed to generate.&lt;/p&gt;
&lt;h3&gt;Choosing a Database&lt;/h3&gt;
&lt;p&gt;Of course, dealing with so many small files, a database was the right solution to this problem. Why didn't I start with a database to begin with? Partially because I wanted to make it easy to load individual unitaries (&lt;code&gt;np.loadtxt&lt;/code&gt; is an incredibly handy API). Also, I was just hacking this data generation script together.&lt;/p&gt;
&lt;p&gt;I had one issue with switching to a database: I wanted something simple and lightweight, I didn't need anything fancy like postgres or the like. Sqlite is the obvious choice but sqlite does not by default support concurrent &lt;em&gt;writes&lt;/em&gt;, which is exactly what I wanted to do!&lt;/p&gt;
&lt;h1&gt;Solution&lt;/h1&gt;
&lt;p&gt;So how can one achieve concurrent writes in Python using sqlite3? Sqlite by default uses a rollback log to maintain consistency. You can change the configuration so that sqlite uses &lt;a href="https://www.sqlite.org/wal.html"&gt;a &lt;em&gt;write-ahead&lt;/em&gt; log (WAL) mode&lt;/a&gt; as well, which allows for concurrent writes. You can enable WAL mode in Python by setting the &lt;a href="https://www.sqlite.org/pragma.html#pragma_journal_mode"&gt;journal mode&lt;/a&gt;:&lt;/p&gt;
&lt;div class="codehilite" style="background: #0d1117"&gt;&lt;pre style="line-height: 125%;"&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span style="color: #8b949e; font-style: italic"&gt;# assume some Cursor object `cursor`&lt;/span&gt;
&lt;span style="color: #e6edf3"&gt;cursor&lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;.&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;execute(&lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;&amp;#39;PRAGMA journal_mode = WAL&amp;#39;&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;However, I started getting exceptions part way through. Some processes calculating the unitaries were being told the database was locked, even though it should not be (these processes should be writing to the WAL, which I want to always be available). Therefore I also set &lt;a href="https://www.sqlite.org/pragma.html#pragma_synchronous"&gt;the sqlite pragma &lt;code&gt;synchronous&lt;/code&gt;&lt;/a&gt; to &lt;code&gt;OFF&lt;/code&gt;, which means that the WAL does not synchronize before checkpoints. Note this is &lt;strong&gt;dangerous&lt;/strong&gt; because theoretically your database could become corrupted if the process crashes or the server shuts down. This is acceptable to me because I can always regenerate the database and either of these are very unlikely to occur while I run these data generation tasks. This can be done in Python like so:&lt;/p&gt;
&lt;div class="codehilite" style="background: #0d1117"&gt;&lt;pre style="line-height: 125%;"&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span style="color: #8b949e; font-style: italic"&gt;# assume some Cursor object `cursor`&lt;/span&gt;
&lt;span style="color: #e6edf3"&gt;cursor&lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;.&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;execute(&lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;&amp;#39;PRAGMA synchronous = OFF&amp;#39;&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In summary, by enabling the WAL and turning some sync'ing off, I was able to get multi-processed Python code to concurrently write to a sqlite database. This also gave a nice speed bump since sqlite is optimized for writing many small amounts of data to disk, a nice bonus!&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Emma Smith</dc:creator><pubDate>Fri, 19 May 2023 00:00:00 -0700</pubDate><guid>tag:emmatyping.dev,2023-05-19:/using-multiprocessing-and-sqlite3-together.html</guid><category>misc</category><category>sql</category><category>python</category><category>multiprocessing</category></item></channel></rss>